<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">ELECTRONIC CORPORA: AS POWERFUL TOOLS IN COMPUTATIONAL LINGUISTIC ANALYSES</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Mohamed</forename><surname>Grazib</surname></persName>
							<email>mfgrazib@hotmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">Computer science department Sidi Bel</orgName>
								<orgName type="institution">Djillali Liabes University</orgName>
								<address>
									<settlement>Abbes</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">ELECTRONIC CORPORA: AS POWERFUL TOOLS IN COMPUTATIONAL LINGUISTIC ANALYSES</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">B282AC15CA523280BD94552E0D216965</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T00:20+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>corpus</term>
					<term>computational linguistics</term>
					<term>corpus linguistics</term>
					<term>Concordances</term>
					<term>collocations and frequencies</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Technology has emerged almost all the domains in our daily life. In computational linguistics, the uses of electronic corpora are very important. Nowadays it is possible to study linguistic phenomena by using statistical analyses: Concordances, collocations and frequencies have great influence in making linguistic researches more available, more adequate and more accurate.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">What is computational linguistics?</head><p>The Association for Computational Linguistics defines computational linguistics as the scientific study of language from a computational perspective. Computational linguistics is a discipline between linguistics and computer science .It is a part of the cognitive sciences and it has a strong relation with artificial intelligence. Computational linguistics originated from the 1950s, where the United States used computers to translate automatically texts from foreign languages into English, particularly Russian scientific journals. Traditionally, computational linguistics was usually performed by computer scientists who had specialized in the application of computers to the processing of a natural language.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">What is Corpus linguistics?</head><p>Corpus linguistics is the study and analysis of data obtained from a corpus. The main task of the corpus linguist is not only to find the data but to analyse it. Computers are useful, and sometimes indispensable, tools used in this process. Corpus linguistics is based on two main software objects: a corpus, which is the body of data to be investigated, and a concordancer, a tool for searching that corpus. Corpus Linguistics is now seen as the study of linguistic phenomena through large collections of machine-readable texts: corpora. Biber et al (1998:23) said that: "Corpus linguistics makes it possible to identify the meanings of words by looking at their occurrences in natural contexts, rather than relying on intuitions about how a word is used or on incomplete citation collections".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Size of corpora:</head><p>Corpora come in many shapes and sizes, because they are built to serve different purposes. Nowadays 1 million words is fairly small in terms of corpora. We can make a distinction between reference 1 and monitor corpora 2 : The following list shows a very limited sample of corpora's sizes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>•</head><p>Bank of English: about 400 million words.</p><p>• COBUILD/Birmingham Corpus: More than 200 million words.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>•</head><p>Longman Lancaster corpus: 30 million words. ____________________________ 1 Reference corpora have a fixed size (e.g., the British National Corpus). 2 Monitor corpora are expandable (e.g., the Bank of English).</p><p>• British National Corpus (BNC):100 million words.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>•</head><p>American National Corpus (ANC): 11.5 million words.</p><p>• Brown corpus: 1million words.</p><p>• Lancaster-Oslo/Bergen (LOB) corpus: 1 million words.</p><p>• Northern Ireland Transcribed Corpus: 400,000 words.</p><p>• Corpus of Spoken American English (CSAE):200,000 words.</p><p>What is evident is that the size of any corpus depends mainly on the purposes it was Created for, and that this size can vary from some hundred words to some million words.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Concordance, Collocation and Frequency.</head><p>In reality, a corpus by itself can do nothing at all; it is nothing other than a store of used language. A corpus does not contain new information about language but the software offers us new perspectives. Most readily available software packages process data from a corpus in three ways: showing, frequency, phraseology and collocations. G. <ref type="bibr">Cook (2003:111)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Concordance:</head><p>A concordance is a screen display or printout of a chosen word or phrase in its different contexts, with that word or phrase arranged down the centre of the display along with the text that comes before and after it. John Sinclair (1991:32) defines a concordance as a collection of the occurrences of a word-form each in its own textual environment. In the same context S. <ref type="bibr">Hunston (2002:39)</ref> says that it is a programme that searches a corpus for a selected word or phrase and presents every instance of that word or phrase in the centre of the computer screen with the words that come before and after it to the left and right. The selected word appearing in the centre of the screen is known as the "node word".</p><p>The following example illustrates the 10 concordances of the word computer from Web Concordancer LOB.txt.</p><p>1 etition between the analogue computer and the digital computer. To a 2 g made on a Ferranti Mercury Computer at Meteorological Office, Duns 3 unnecessary devices that the computer can be made an economic propos 4 ouch with manufacturers about computer developments of special signi 5 he {0PIW} are compiled by the computer from data sheets (dictionary 6 seen that the problem of the computer is in no way related to the p 7 racy is required the digital computer is the only one to use and ele 8 he {0ACE} digital electronic computer of our laboratory and, further 9 with the help of the digital computer of the University of Toronto. 10 al purpose electronic digital computer to do the job. It is therefore</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Collocation:</head><p>Firth (1957) stated that "you shall know the word by the company it keeps". The meaning of Firth's citation here is to classify words not only on the basis of their meanings, but also on the basis of their co-occurrence with other words. S. <ref type="bibr">Hunston (2002:12)</ref> defines collocation as the statistical tendency of words to cooccur.</p><p>Collocation investigations can be a preliminary step for other research questions: investigating the distribution of word senses and uses, and comparing the use of seemingly synonymous words, because languages have many words that are similar, and dictionary definitions often characterise such words as identical or synonymous in meaning, however investigating the use and distribution of synonyms in a corpus allows us to determine their contextual preferences associated with other collocates or associated with register differences.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Using electronic corpora in computational linguistics:</head><p>Many words have meanings that are similar, and yet the words are not able to be substituted one for the other. Dictionaries, which deal with words separately rather than comparatively, can be of little help, but observing typical usages of near synonyms can clarify differences in meaning. S. <ref type="bibr">Hunston (2002:45)</ref>. The study is about the following synonyms (Sheer, pure, complete, utter and absolute). The first analysis is upon a traditional research (using dictionaries) In this context <ref type="bibr">Partington (1998:33-46)</ref> gives examples of intensifying adjectives: "sheer, pure, complete, utter, and absolute". He points out that dictionaries tend to define those words in similar ways, and even give them as synonymous of each other:</p><p>-The Collins COBUILD English Dictionary (CCED), suggests that "complete" and "pure" are synonyms of "sheer". -</p><p>The Longman Dictionary of Contemporary English (LDOCE) gives "pure" as a synonym of "sheer". -The earlier Collins COBUILD English Language Dictionary (CCELD)</p><p>gives "absolute" as a super ordinate of "sheer". In spite of this apparent similarity in meanings, the typical collocates of each adjective differ to quite a considerable degree. For example "sheer" is used with nouns of degree or magnitude (sheer weight, sheer number) often in the pattern (the sheer noun + of noun); e.g. (the sheer weight of noise). The other adjectives do not collocate with these nouns. In addition, "sheer" alone is often used in expressions indicating causality (though sheer insistence; by sheer hard work; because of sheer hard work; his sheer integrity got him though; his enthusiasm and sheer hard work meant that things moved quickly). Partington (1998:36). He ends this analysis by making some statements: "Complete", is used with nouns indicating:</p><p>-Absence:(complete ban) -Change:( complete revamping) -Destruction: (complete collapse) -Absolute is used with what Partington calls "hyperbolic" nouns, such as (chaos, disgrace, genius……). Ibid (1998:43) By using corpora we can see immediately that complete is the most used word (12594 times), followed by the adjective absolute (3432 times) , in the 3 rd position we can find that pure is used (3305 times), sheer is used (2028 times) ; however utter is the last position with only a frequency of (652).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.2.">By register analyses</head><p>The following five tables show the frequencies of the adjectives concerned by registers: In analysing collocation's table we can notice immediately that the words: (white 104, new 18 and public 12) are the most frequent words that collocate with the adjective pure; the word (new 29) collocates most frequently with complete. The first frequent word that collocates with absolute is (best 11); however the adjective utter does not exist in the top 20 words that collocate with the adjectives listed before.</p><formula xml:id="formula_0">REGISTER</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">CONCLUSION:</head><p>Corpus Linguistics has developed considerably in the last decades due to the great possibilities offered by the natural language processing with computers. The availability of computers and machine-readable texts has made it possible to get data quickly and easily. Linguistic domains are investigated by the use of computers; the results are very amazing if compared with the traditional research methods.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>____________________________ 1 . 2 .</head><label>12</label><figDesc>Includes wh-words, foreign words, numerals……… The analyses are from the Brown Corpus of American English (Francis&amp;Kucera.1982:547) and the Lancaster-Oslo-Bergen (LOB) Corpus of written British English (Johansson&amp;Hofland.1989:15).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Frequency list tells us what words and phrases are used most often. Biber et al (1998:23) argue that frequency investigations tell us how often different words are used; allowing us to identify particularly common and uncommon words. Based on the evidence of the billion-word Oxford English Corpus, the 100 commonest English words found in writing around the world are as follows:</figDesc><table><row><cell>Nouns</cell><cell></cell><cell>Verbs</cell><cell></cell><cell>Adjectives</cell><cell></cell></row><row><cell>WORD</cell><cell># TIMES NEARBY</cell><cell>WORD</cell><cell># TIMES NEARBY</cell><cell>WORD</cell><cell># TIMES NEARBY</cell></row><row><cell>1 YEARS</cell><cell>1933</cell><cell>WAS</cell><cell>17846</cell><cell>LONG</cell><cell>4850</cell></row><row><cell>2 YEAR</cell><cell>1703</cell><cell>IS</cell><cell>12614</cell><cell>GOOD</cell><cell>1587</cell></row><row><cell>3 PERIOD</cell><cell>1360</cell><cell>HAD</cell><cell>8128</cell><cell>SHORT</cell><cell>1522</cell></row><row><cell>4 PEOPLE</cell><cell>1334</cell><cell>BE</cell><cell>8023</cell><cell>OTHER</cell><cell>1202</cell></row><row><cell>5 DAY</cell><cell>1139</cell><cell>WERE</cell><cell>4298</cell><cell>RIGHT</cell><cell>1111</cell></row></table><note>Biber et al(1998:24). The following table shows the five most time's collocations with nouns, verbs, and adjectives. collocations of the word time with (nouns, verbs, and adjectives) from view.byu.edu :6.3. Frequency:Table2: The first 100 frequent English words.As seen in the table above many of the most frequently used words are grammatical words (articles, auxiliaries, prepositions….); however the first noun position (time) is the 55 th . We can also explore frequencies according to the main word classes: The frequencies of the main word classes in 1 million-word computer corpora of written English are given in the table bellow:</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 3 :</head><label>3</label><figDesc>The frequencies of the main word classes</figDesc><table><row><cell>By making a brief analysis 2 , based on the information taken from the table above, we</cell></row><row><cell>can notice that:</cell></row><row><cell>-The nouns are the most frequent used words</cell></row><row><cell>-Verbs are more frequent in conversation and in fiction.</cell></row><row><cell>-Pronouns are more frequent in spoken English and in fiction that in</cell></row><row><cell>informative writing.</cell></row><row><cell>-Conjunctions have almost the same frequency in both corpora</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>7.1. The corpora analyses. 7.1.1. By frequency analyses</head><label></label><figDesc></figDesc><table><row><cell cols="4">The table bellow shows the frequencies of the adjectives (complete, absolute,</cell></row><row><cell cols="2">sheer, and utter):</cell><cell></cell><cell></cell></row><row><cell>DISTRIB</cell><cell>WORD/PHRASE</cell><cell>TOKENS</cell><cell>PER MIL IN REG1</cell></row><row><cell></cell><cell></cell><cell>REG1</cell><cell>[100,000,000 WORDS]</cell></row><row><cell>1</cell><cell>COMPLETE</cell><cell>12594</cell><cell>125.94</cell></row><row><cell>1</cell><cell>ABSOLUTE</cell><cell>3432</cell><cell>34.32</cell></row><row><cell>1</cell><cell>PURE</cell><cell>3305</cell><cell>33,05</cell></row><row><cell>1</cell><cell>SHEER</cell><cell>2028</cell><cell>20.28</cell></row><row><cell>1</cell><cell>UTTER</cell><cell>652</cell><cell>6.52</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 :</head><label>4</label><figDesc>Frequencies from (view.byu.edu )</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>SPOKEN FICTION NEWS ACADEM IC NONFIC MISC OTHER MISC</head><label></label><figDesc></figDesc><table><row><cell cols="3">7.1.3. By collocation analyses.</cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="6">The following table shows us the adjectives with their top 20 th most frequent</cell></row><row><cell>collocations.</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">DISTRIB</cell><cell>WORD/PHRASE</cell><cell></cell><cell>TOKENS</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>REG1</cell><cell></cell></row><row><cell></cell><cell>1</cell><cell>PURE WHITE</cell><cell></cell><cell>104</cell><cell></cell></row><row><cell></cell><cell>2</cell><cell>COMPLETE NEW</cell><cell></cell><cell>29</cell><cell></cell></row><row><cell></cell><cell>3</cell><cell>SHEER HARD</cell><cell></cell><cell>25</cell><cell></cell></row><row><cell></cell><cell>4</cell><cell>PURE NEW</cell><cell></cell><cell>18</cell><cell></cell></row><row><cell></cell><cell>5</cell><cell>COMPLETE UNIFIED</cell><cell></cell><cell>17</cell><cell></cell></row><row><cell></cell><cell>6</cell><cell>SHEER PHYSICAL</cell><cell></cell><cell>17</cell><cell></cell></row><row><cell></cell><cell>7</cell><cell>PURE PUBLIC</cell><cell></cell><cell>12</cell><cell></cell></row><row><cell></cell><cell>8</cell><cell>ABSOLUTE BEST</cell><cell></cell><cell>11</cell><cell></cell></row><row><cell></cell><cell>9</cell><cell cols="2">COMPLETE SUPRACONAL</cell><cell>11</cell><cell></cell></row><row><cell>TOKENS</cell><cell>23 10 11</cell><cell>31 COMPLETE PHYSICAL 30 COMPLETE POLITICAL</cell><cell>144</cell><cell>139 10 9</cell><cell>217</cell></row><row><cell cols="6">12 adjectives, we can distinguish that it reaches only a frequency of 9.3 per million COMPLETE SHORT 9 the other 13 PURE ORAL 9</cell></row><row><cell>words</cell><cell>14</cell><cell cols="2">COMPLETE FINANCIAL</cell><cell>8</cell><cell></cell></row><row><cell></cell><cell>15</cell><cell>COMPLETE HUMAN</cell><cell></cell><cell>8</cell><cell></cell></row><row><cell></cell><cell>16</cell><cell>COMPLETE MENTAL</cell><cell></cell><cell>8</cell><cell></cell></row><row><cell></cell><cell>17</cell><cell>PURE ECONOMIC</cell><cell></cell><cell>8</cell><cell></cell></row><row><cell></cell><cell>18</cell><cell>ABSOLUTE MINIMUM</cell><cell></cell><cell>7</cell><cell></cell></row><row><cell></cell><cell>19</cell><cell>ABSOLUTE MORAL</cell><cell></cell><cell>7</cell><cell></cell></row><row><cell></cell><cell>20</cell><cell cols="2">COMPLETE MONETARY</cell><cell>7</cell><cell></cell></row></table><note>Table9:The frequency of "absolute" by registers from (view.byu.edu ) .The adjective absolute is used mainly in the academic register by a frequency of 59.9 per million words; however the adjective pure is also used in the fiction register by a frequency of 27.4 per million words, and in the other registers by a frequency of 24.9 per million words. The adjective sheer is used mainly in fiction 422 times which means 26.1 per million words, it is also used in news register by a frequency of 20.5 per million words; but in what concerns the adjective utter, it is mainly used in fiction register( only 12.1 per million words); however its use in the other registers is less important. The adjective complete, is less used if compared with</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 10 :</head><label>10</label><figDesc>adjectives with the top 20 th most frequent collocations from (view.byu.edu ) .</figDesc><table /></figure>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0" />			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Corpus Linguistics. Investigating language structure and use</title>
		<author>
			<persName><forename type="first">D</forename><surname>Biber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Conrad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Reppen</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1998">1998. 1998</date>
			<publisher>CUP</publisher>
			<pubPlace>Cambridge</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Applied linguistics</title>
		<author>
			<persName><forename type="middle">G</forename><surname>Cook</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2003">2003</date>
			<publisher>Oxford. OUP</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Firth</surname></persName>
		</author>
		<title level="m">Papers in Linguistics 1934-1951</title>
				<meeting><address><addrLine>Oxford</addrLine></address></meeting>
		<imprint>
			<publisher>OUP</publisher>
			<date type="published" when="1957">1957. 1957</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Frequency analysis of English usage</title>
		<author>
			<persName><forename type="first">W</forename><surname>Francis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kucera</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1982">1982</date>
			<publisher>Houghton Mifflin</publisher>
			<pubPlace>Boston , MA</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Corpora in applied linguistics</title>
		<author>
			<persName><forename type="first">S</forename><surname>Hunston</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2002">2002</date>
			<publisher>Cambridge University Press</publisher>
			<pubPlace>Cambridge</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Frequency Analysis of English vocabulary and grammar: based on the LOB corpus</title>
		<author>
			<persName><forename type="first">S</forename><surname>Johannsson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hofland</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1989">1989</date>
			<publisher>Clarendon Press</publisher>
			<pubPlace>Oxford.</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Patterns and meanings: Using corpora for English language research and teaching</title>
		<author>
			<persName><forename type="first">Alan</forename><surname>Partington</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1998">1998</date>
			<publisher>John Benjamins PublishingCompany</publisher>
			<pubPlace>Amsterdam</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Corpus, concordance, collocation</title>
		<author>
			<persName><forename type="first">John</forename><surname>Sinclair</surname></persName>
		</author>
		<author>
			<persName><surname>Mch</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1991">1991</date>
			<publisher>Oxford University Press</publisher>
			<pubPlace>Oxford</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Professor of Corpus Linguistics</title>
		<editor>Mark Davies</editor>
		<imprint/>
		<respStmt>
			<orgName>edu ; Brigham Young University</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title/>
		<author>
			<persName><surname>Lob</surname></persName>
		</author>
		<imprint/>
		<respStmt>
			<orgName>Web Concordancer</orgName>
		</respStmt>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
