English for Technology research through a corpus

Read the following paper  presented by Olga Moudraia call The Student Engineering Corpus: Analysing word frequency at Lancaster University, UK about English for Technology and answer to the following questions:

– Have a look at the frequency lists shown in the article and think about the sort of vocabulary that could be found in a technical text.

– Observe the comparative with the fifty most frequent word forms in the Student Engineering Corpus, the COBUILD Bank of English Corpus and the BNC Written and comment your own conclusions here in this blog.

– Read figure 9 (Comparison between the Student Engineering semantic frequency list and BNC IT) and try to draw your own conclusions about the different semantic fields arose dealing with tehnological matters.

Advertisements

8 thoughts on “English for Technology research through a corpus

  1. 1. The most frequent word forms in the Student Engineering Corpus are mainly function words as it occurs in general texts (the, of, a, and, is in, to, for are, ,,,).
    2. The content word forms in the Student Engineering Corpus are predominantly from the scientific register (use, force, form, flow, pressure, to show, determine,…) while the most frequent content word forms in COBUILD and BNC Written are of a general nature.
    3. Most frequently encountered words in the Student Engineering Corpus appear to be subtechnical words with non-technical as well as technical senses. They are very common in most technical writings and thus, I have reached the same conclusion in my Final Grade Assessment dealing with the analysis of medical-legal terms.
    4. The ability to mobilize the resources of general English in the solving of technical problems, is then more useful than technical vocabulary.
    5. Materials properties, condition or state, description of processes, time and location, measurements, causation, analysis of results and comparisons are for me the main semantic fields of this study and the most common in general technological matters.

  2. I agree with what Eduardo said. I would just like to comment on the last point.

    I think that regardless of the language used to transmit a message or new idea, there are several conventions apart from the language used that also have to be observed. In my opinion this is what is reflected in figure 9. All these metalinguistic items make the understanding of the message much easier for people working in the same field, even though they have a different mother tongue. This is also the reason why it is easier for these professionals to convey their ideas than it is the case in general usage. This also means that conforming to these conventionsn is even more important than to use language correctly so as to get your message across.

    Understanding this kind of information will be of great use for linguists whenever they have to deliver an EST course, which is something they lack because they are not professionals of these fields.

  3. Extra activity 3.1: Read the article by Olga Moudraia called The Student Engineering Corpus: Analysing word frequency at Lancaster University, UK, about English for Technology.

    A) Try to add definitions for: Keyword, concordance, frequency word list, etc.
    – concordances: collocates that are used to differentiate between parts of speech (i.e. use as a noun and a verb), homonyms (i.e. light = heavy / light = dark), and different senses of the same word (i.e. impress meaning a) press hard into a soft surface leaving a mark or b) have a favourable effect on somebody).
    – frequency word lists: lists of all the words or word-clusters in the corpus, set out in alphabetical and frequency order. J. Nattinger suggests that clustering words aids students in remembering the general meaning of a word. Carter and McCarthy’s note that: ‘knowing a word’ means knowing its underlying forms and derivations. So word lists can be useful in a number of ways – course designers may refer to them when considering the vocabulary component of a language course; teachers may use them to judge whether a particular word deserves attention or not, and whether a text is suitable for a class; English instructors in selecting the vocabulary component of a ESP course or deciding whether a particular text is useful for a class, or in creating a lexical syllabus foundation; and EFL/ESP learners may use them as a checklist or even as a goal in order to broaden their lexical base and, more importantly, raise their awareness of the lexical nature of language: which word forms of a particular word are used more frequently than the others, what part of speech represented by a particular word form is encountered more commonly (i.e. use as a noun or use as a verb), and even compare different spelling of the same word (i.e. usable and useable or reuse and re-use).
    – headwords: the base word or the most frequent word in the family. Nation points out that grouping words under headwords is an attempt to increase the coverage of high-frequency vocabulary; the implication is that learning a word involves learning its derived and inflected forms as well.
    – keywords: they are not the most frequent words but the words which are most unusually frequent in a given body of text against the reference corpus (a large corpus of text that acts as a norm).
    – key keywords: words that are most frequent over a number of files in the database.
    – Semantic tagging: is a form of corpus annotation that assigns semantic field tags to the words in a corpus.
    – sub-technical words: words with non-technical as well as technical senses, common in most kinds of technical writing.
    – word family: includes derived and inflected forms as well as compound words. Thus, the word entry “use” lists not only use, uses, using, used but also useful, usable, user, reuse, unused, misuse, abuse, multiuse and their derivatives, giving details on the ‘sub-families’ within a family, i.e. misuse within use.

    B) Have a look at the frequency lists shown in the article and think about the sort of vocabulary that could be found in a technical text. Observe the comparative with the fifty most frequent word forms in the Student Engineering Corpus, the COBUILD Bank of English Corpus and the BNC Written and comment your own conclusions.
    – the most frequent word family in engineering textbooks found in the Student Engineering Corpus is “use”
    – the most frequent word forms in all three corpora (the Student Engineering Corpus, the COBUILD Bank of English Corpus and the BNC Written) – being mainly function words – coincide. These are: the, of, a, and, is, in, to and for.
    – the 50 most frequent content word forms in the Student Engineering Corpus are predominantly from the scientific register (figure, flow, determine, force, velocity, pressure) while the most frequent content word forms in COBUILD and BNC Written are of a general nature (is, was, be, have, are, had, has, will, were).
    – Words in general non-technical sense (solution: as a problem) appear to be more frequently used in the Student Engineering Corpus than the specialist terms (solution: as a liquid).
    – the key verbs in the Student Engineering Corpus appear to be predominantly from the academic register and are as follows: be, show, determine, use, require, obtain, apply, assume, calculate, correspond to, define, give, act, illustrate, occur, become, consider, exert, indicate, locate, sketch, solve, and substitute. Considering this information, more attention in the ESP classrooms should be given to non-technical than technical vocabulary. This concurs with Hutchinson and Waters’ opinion that students of ESP require not a corpus of technical language but the ability to mobilize the resources of general English in the solving of technical problems, and that technical English is only a development or extended application of general English while a wide-ranging knowledge of everyday vocabulary and the ability to mobilize this knowledge in the interpretation of technical discourse are important aids to comprehension and memory.

    C) Read figure 9 (Comparison between the Student Engineering semantic frequency list and BNC IT) and try to draw your own conclusions about the different semantic fields dealing with technological matters.
    The content analysis of the Student Engineering Corpus (which represents English engineering lexis encountered in compulsory textbooks for engineering students) has been preformed by a web front end called Wmatrix, which allows seeing significant concepts in the corpus and the words related to those concepts in frequency order. The Wmatrix semantic frequency list has a very useful option called ‘compare to normative BNC IT’ which compares the concept frequencies against a sub-corpus of the BNC that has been semantically tagged. The BNC IT corpus contains 135 files selected from the pure and applied science section of the BNC which are related to Information Technology (IT). In Wmatrix, the results are sorted on the LL (log-likelihood) field which shows how significant the difference is. The items with a ‘+’ code signify overuse in the given text as compared to the Standard English corpora. To be statistically significant, the LL value should be over 6.63. Here, the 10 most significant semantic categories are: numbers; shape; measurement: weight; grammatical bin, mathematics, objects generally, substances and materials: liquid; substances and materials generally; measurement: length & height; and temperature. These results could provide the foundation for a lexical syllabus for an Engineering English course which should develop around these 10 semantic categories in order to train Engineers in their Engineering English proficiency.

  4. Moudraia’s research aims at the compilation of an engineering vade mecum: a ready reference of the most useful, if not the most commonplace, terms in the field of applied science. Her initial observation (p. 552) places small corpora at the centre of linguistic research, even though she will later acknowledge that such corpora could not even have existed in the first place without the in-depth analysis of very large amounts of data provided by the long-established multi-million token corpora: ‘[…] smaller corpora are designed to represent the specific part of the language under investigation and are tailored to address the aspects of the language relevant to the needs of the learner’.

    Needs analysis and a serious consideration of the effect of a well-designed study programme on all stakeholders are fundamental in LSP teaching and learning, and, in view of the profusion of research papers being published, it is a key aspect of ESP/EAP. In Moudraia’s view, a future engineer needs to become familiar with a very particular basic terminology, as well as the most common uses -and abuses- of the engineering jargon.

    If we analyse the comparative frequency wordlists presented in her document (pp. 556-558), we immediately realise that the 25 most frequent grammatical word forms only differ slightly between the large corpora -BNC Written and COBUILD- and Moudraia’s 2,000,000 token Student Engineering Corpus. Dissimilarities can be observed from the 25th word form onwards, where surprisingly infrequent words start cropping up.

    Content words, however, are a different matter altogether. Word forms such as ‘equation’ (18), ‘section’ (26) or ‘axis’ (47), are in the top-50 most frequent terms of the Engineering Corpus, while they would not be likely to come up in the long-established corpora even if we had access to the first 3,000+ tokens.

    Figure 9 (p. 560-561) pays attention to the divergent characters of small corpora with a specialized focus -in this case the aforementioned Student Engineering Corpus- and the very large corpora that have been in existence for decades. Those semantic fields that the average reader would automatically associate with the world of science seem to be up to 13,000 times more frequent in a small specialized corpus. Here it must be noted that so exaggerated a result can only be due to the relatively small number of tokens Moudraia’s corpus handles. The author herself says (p. 552): ‘[…] a corpus with less than a million words will be considered small now’. Hers is only twice the threshold size, thus the peak results in figure 9.

    In the field of the Humanities, the question still remains: should second language learners be exposed to the 100 most frequent words (both in the grammar and the content senses) at the very beginning of their education regardless the traditional consideration of these items as ‘difficult words’? Indeed, ‘which’ is a very frequent word according to the BNC, but it would take quite a long time to get acquainted with it if we were to follow an English language learning method.

  5. 1.- Having a look at the frequency lists shown in the article we can see that, the same as in any other Language for Specific Purposes , we can find technical, subtechnical and general vocabulary.

    2.- Observing the comparative with the fifty most frequent word forms in the Student Engineering Corpus, the COBUILD Bank of English Corpus and the BNC Written we can draw the following conclusions:
    a-the most frequent 50 words in the three of them are “function” words and they practically concur.and comment your own conclusions here in this blog.
    b-Whereas the most frequent 50 “content” words in the SEC belong to the scientific register, their counterparts in the COBUILD and in the BNC Writen belong to general vocabulary. Taking this into account, most of these technical words in the SEC are of the semi-technical kind and within this semitechnical kind they are more frequently used in their non technical sense than in its technical one, which permits us to conclude that more attention should be given to non technical than to technical vocabulary and that technical English is only a develpment or extended application of general English.

    3- Regarding figure 9 (Comparison between the Student Engineering semantic frequency list and BNC IT) we can draw the following conclusions about the different semantic fields dealing with tehnological matters:

    The content analysis of the SEC is performed using a web front end called Wmatrix. Wmatrix allows us to analyze the frequency order regarding semantic fields in the SEC and to compare it to that frequency order in the normative BNC IT, alloting a coefficient number to every semantic field in the SEC in relation to its frequency in the BNC. That way we can see in which proportion a concrete semantic field is more used in the SEC than in the BNC.

    The ten semantic fields with the highest coefficient regarding overuse in relation with its frequency in the BNC are: numbers, shape, weight, grammatical bin, mathematics, objects generally , liquids, substances and materials generally, lenght and height, and temperature.

    Knowing that these are the most frequent semantic fields in technological english in relation with general english may help us to lay out a proper syllabus to train engineers to be proficient in these ones.

    – Read figure 9 (Comparison between the Student Engineering semantic frequency list and BNC IT) and try to draw your own conclusions about the different semantic fields arose dealing with tehnological matters.

  6. The most frequent word forms in the three of them are different forms of the verb to be, in fact the three first positions (except in the BNC) are occupied by this verb, which reveals the importance of the concept of existence both in the general language and in the technical world. It is in the engineering corpus where it is most used ( a 2.43% versus a 0.93% and 0.99% in the rest of corpora).
    The general corpora (COBUILD and BNC) list only auxiliary and modal verbs on the first 15 positions (only the noun “one” appears within this section, in posts 9 and 12 respectively). On the other hand, the Student Engineering Corpus offers only three modals and the noun “two” in the same section of the list; the most significant here is the appearance of 6 general words, nouns and verbs, used in specific contexts (figure, flow, determine, force, velocity, pressure). Thanks to this we realize that:
    1. General words used in specific contexts are very important in scientific writings.
    2. Modality is unimportant in scientific texts, where accuracy is essential. In fact, scientific texts show facts as they are without exhibiting the subjectivity of their author.
    If we continue looking until the bottom of the list, we can see that the general corpora continues listing adjectives, adverbs and nouns among some other auxiliaries and modals, while the Student Engineering Corpus only offers two auxiliaries and one modal (must) in the remaining 35 positions. No adjectives or verbs appear, only nouns and the adverb thus. This confirms that scientific texts are more interested in objective descriptions and not in opinions.
    Pilar

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s