• No results found

ASSESSING ACADEMIC LITERACY OF FIRST YEAR VIETNAMESE UNIVERSITY STUDENTS: HOW APPROPRIATE IS THE TALL?

N/A
N/A
Protected

Academic year: 2021

Share "ASSESSING ACADEMIC LITERACY OF FIRST YEAR VIETNAMESE UNIVERSITY STUDENTS: HOW APPROPRIATE IS THE TALL?"

Copied!
115
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

ASSESSING ACADEMIC LITERACY OF FIRST YEAR

VIETNAMESE UNIVERSITY STUDENTS: HOW APPROPRIATE

IS THE TALL?

LOAN LE S2048779 MA thesis

Department of Applied Linguistics Faculty of Arts

University of Groningen

Supervisors: Dr. Hilde Hacquebord Prof. Albert Weideman

(2)

i

ACKNOWLEDGEMENT

I would like to express my sincere thanks to the people who have supported me enormously in completing this thesis. First and foremost, I would like to thank Dr. Hilde Hacquebord for her useful instructions and orientation which help to keep me on the right track regarding the scope of this study; and for her faith in the feasibility of this thesis, which inspired me to undertake the dissertation.

Secondly, I wish to thank Prof. Albert Weideman, University of the Free State, South Africa. He has been impressively supportive and patient since the very beginning when I talked to him about my interests in the testing of academic literacy and the TALL of which he is an author. Without his help with the statistical analyses, explanation of those terms, and his valuable corrections on certain chapters, I would not have been able to finalize this thesis.

Moreover, I would like to thank Mr. Mik Van Es for his help with the psychometric analyses. Also, I would like to thank the NFP program for granting my trip to Vietnam to conduct the pilot test and to collect data for the thesis.

Additionally, I would like to thank my dear friend and sister, Lien, for spending a couple of days to help me with documenting the raw scores of the pilot test takers.

(3)

ii

ABSTRACT

(4)

iii

TABLE OF CONTENT

ACKNOWLEDGEMENT ... I ABSTRACT ...II TABLE OF CONTENT ... III

CHAPTER ONE: INTRODUCTION ... 4

CHAPTER TWO: BACKGROUND ... 6

2.1INTRODUCTION ... 6

2.2WHAT IS ACADEMIC LITERACY? ... 6

2.3ACADEMIC LITERACY TESTING ... 9

2.4WHAT ARE THE QUALITIES OF LANGUAGE TESTS?... 10

2.5ACADEMIC LITERACY AND READING COMPREHENSION SKILL ... 14

2.6RESEARCH QUESTIONS ... 17

CHAPTER THREE: RESEARCH METHODOLOGY ... 18

3.1INTRODUCTION ... 18

3.2PARTICIPANTS... 18

3.3PILOT TESTING MATERIALS ... 18

3.3.1 Construct of TALL ... 18

3.3.2 The personal profile questionnaire ... 20

3.4PROCEDURES ... 20

3.4.1 Purpose of the pilot ... 20

3.4.2 Administration of the test ... 21

3.5DATA ANALYSIS ... 21

3.5.1 Test and Item analysis (the Iteman program) ... 22

3.5.2Factor analysis (the Tiaplus program) ... 23

CHAPTER FOUR: RESULTS AND DISCUSSION ... 25

4.1INTRODUCTION ... 25

4.2 PARTICIPANTS’PERSONAL PROFILE ... 25

4.3PILOTTESTRESULTSANDINTERPRETATIONS ... 27

4.3.1 Overview of test scores of CFL and UP students ... 27

4.3.2 Comparison TALL scores of students of CFL and UP ... 29

4.3.3 Results from Tiaplus program for the TALL scores of CFL students and UP students ... 30

4.3.4 Dimensionality ... 32

4.3.5 Discrimination item and alpha if items removed via Tiaplus program ... 33

4.3.6 Item discrimination index and facility values from the Iteman program (Attached in the appendix). ... 36

4.3.7 Correlations between sections of the test from Test and Item Analysis via Tiaplus program ... 38

4.4.DISCUSSION ... 38

4.4.1 Comparison of scores on TALL by students of CFL and UP ... 38

4.4.2 Appropriateness of TALL to assess academic literacy of CFL students ... 39

4.5RECOMMENDATIONS ... 40

CHAPTER FIVE: CONCLUSION ... 41

(5)

4

CHAPTER ONE: INTRODUCTION

The concern for academic success of Vietnamese undergraduates in recent years has gained increasing attention when the credit-based system, considered as an innovative training system, was applied in certain universities within the country. This concern has grown since researchers identified some challenges of the administration of this system in the undergraduate education in the University of Danang. So far, specific measures to improve effectiveness of the application of the credit-based system have been suggested by a number of teachers and authorities within the university (Dung, 2006; Cuong, 2006; Son, 2010).

(6)

5

losses for the students (taking the courses again and again), their parents (having to pay extra tuition), the education system and for society as a whole (Weideman, 2003, p.56).

It will not be sufficient for language students to be tested with only basic linguistic components, such as grammar, vocabulary, morphology, lexicology or phonology through the employment of four basic language skills, i.e. reading, speaking, writing and listening, and indicate if they may succeed at the colleges of foreign languages. Success at tertiary education for foreign language students, on the contrary, requires a mastery of a combination of skills like critical thinking skill, analytical skills, problem-solving skills, chart/graph-interpreting skills, etc. apart from the four basic language skills. All of these skills help determine students’ academic performance. In addition, research has shown that academic language proficiency pertains to academic success at colleges (Weideman, 2003, p.56). Accordingly, academic language proficiency plays a prerequisite and critical role in the success in tertiary education (Van Rensburg and Weideman, 2002), especially for students majoring in foreign languages. That is why it is a matter of importance to assess new undergraduates’ academic ability to help them realize if they are prepared for college education. However, this kind of assessment is not available at the Danang College of Foreign Languages (CFL) for the present.

In the meantime, tests of academic literacy to be utilized in this study have been used, as a placement and assessment test, by four universities in South Africa to assess students’ academic literacy levels, since 2003. The purpose of the tests is to identify students who are at risks to complete their courses on time due to low academic literacy levels. These students are then placed in support courses for academic literacy enhancement to facilitate their academic success (Weideman, 2003).

(7)

6

CHAPTER TWO: BACKGROUND

2.1 Introduction

This chapter focuses on elaborating the perceptions of academic literacy; and the relationship between academic literacy and skills in foreign language, i.e. reading and writing and academic performance of undergraduates.

2.2 What is academic literacy?

Academic literacy (AL) has become a complicated topic to define and discuss in recent decades. This study will review AL and AL testing from a number of different perspectives.

In the 1990s, academic literacy was seen as “a compound of linguistic, conceptual and epistemological rules and norms of the academe” (Van Schalkwyk, 2008, p.22). According to Blanton (1994), becoming academically literate occurs when a person, who is proficient in academic language speaking and writing skills, can speak and write academically with authority. Ballard and Clanchy (1988) for their part claim that AL should be addressed in a broader theory. According to them, since language and culture cannot be studied separately, they view AL as a specific kind of functional literacy, which is then defined as a learner’s ability to use written language to perform cultural functions. They articulate the following criteria for someone to be academically literate: knowing rules of giving and supporting arguments; and having the ability to use appropriate academic language genres at university level. For example, a student with analytical reasoning skills who can present and support their arguments with clear evidence in an academic style would be considered as academically literate. Ballard and Clancy’s (1988) views of course relate to general ideas of functional literacy, which Venezky (1990) defines as “a general designation of abilities above basic literacy, allowing some level of functioning through print in society” (p.11). He categorizes literacy into three levels: basic literacy; required literacy; and functional literacy. (In particular, basic literacy is defined as “the level that allows self-sustained development in literacy”; required literacy is “the literacy level required for any given social context and which might, therefore, change over time, place and social condition”).

(8)

7

wider range of skills, for example: competence in writing, reading, critical analysis, independent learning, good judgment, and a sense of personal identity. Kern (2000) thus aptly notes that literacy “is an elastic concept; its meaning varies according to the disciplinary lens through which one examines it” (p.23) and so proposes a conceptual framework to study AL in three dimensions: a linguistic, a cognitive, and a sociocultural. Kern observes that it is impossible to understand AL without a comprehensive examination of the three perspectives. Similarly, Gilliver-Brown and Johnson (2009) claim that the development of AL is a long-term process involving the practice and refinement of skills and knowledge.

In alignment with Kern (2000), Newman (2002) also views AL from a sociocultural perspective. He sees AL as a social phenomenon rather than a purely cognitive process, which accordingly involves the motivation to make a decision and also its effect within a set of social norms that determines behaviours of members in the same community. Thus learners can be motivated to show their AL willingly in their community, e.g. in the classroom context or in learning groups. For example, cultural background or social norms can facilitate or prevent a learner’s decision of whether or not to participate actively in a group work discussion.

Echevarria et al. (2004) define AL with reference to three knowledge bases: L2 language knowledge, topical knowledge, and knowledge of how to complete given tasks. Here the importance of language competence is considered. This means a learner has to possess a certain L2 level to be able to grasp academic texts, (in this case, instructions for the tasks); and to use the L2 appropriately in certain contexts, based on their interpretation of the instructions for the given tasks. In addition, literacy in learners’ L1 may affect their acquisition of L2 academic literacy (Echevarria et al. 2004). In this sense, being academically literate is more than possessing language skills, since L2 acquisition is a complex cognitive development process. However, the definition of level of competence is not entirely clear in their exposition.

(9)

8

by students’ literacies in contexts outside school; and AL is influenced by students’ personal, social and cultural experiences” (p.12).

More specifically, Weideman (2003) articulates ten components for an individual student’s academic literacy. Accordingly, an academically literate student would be able to:

1. understand a range of academic vocabulary in context;

2. interpret and use metaphor and idiom, and perceive connotation, word play and ambiguity;

3. understand relations between different parts of a text, be aware of the logical

development of (an academic) text, via introductions to conclusions, and know how to use language that serves to make the different parts of a text hang together;

4. interpret different kinds of text type (genre), and show sensitivity for the meaning that they convey, and the audience that they are aimed at;

5. interpret, use and produce information presented in graphic or visual format;

6. make distinctions between essential and non-essential information, fact and opinion, propositions and arguments;

7. distinguish between cause and effect, classify, categorise and handle data that make comparisons;

8. see sequence and order, do simple numerical estimations and computations that are relevant to academic information, that allow comparisons to be made, and can be applied for the purposes of an argument;

9. know what counts as evidence for an argument, extrapolate from information by making inferences, and apply the information or its implications to other cases than the one at hand;

10. understand the communicative function of various ways of expression in academic language (such as defining, providing examples, arguing); and make meaning (e.g. of an academic text) beyond the level of the sentence.

(10)

9

The above definitions and views on AL are therefore very much related to academic language proficiency, which is exactly what should be tested and assessed if one wishes to determine, eventually, how that ability relates to subsequent academic performance. I turn to a discussion of this in the next section.

2.3 Academic literacy testing

Bachman and Palmer (1996) conclude that for the purpose of language testing, language ability within the framework of language use should be considered. In fact, during the second half of the 20th century, language proficiency was viewed as language ability which was described with respect to the four skills (reading, listening, writing and speaking) and various components (e.g. grammar, pronunciation, and vocabulary) only. Therefore, the testing and assessment of language ability was accomplished through channels (auditory, visual) and modes (productive, receptive) for these skills. However, according to Bachman and Palmer (1996), “language use is not simply a general phenomenon that takes place in a vacuum”. For example, when people read, their actions are not merely about reading, they actually read for a purpose, let us say to find information on how to write a great résumé in the target language, in a certain setting, i.e. applying for a job that requires the use of that language. That is to say, language use is realized in the performance of certain specifically contextualized language use tasks.

For various early views of AL, academic literacy should be assessed as language proficiency, or language skills and critical thinking. However, Van Dyk and Weideman (2004a) argue that academic literacy should be assessed with “an enriched, open view of language and academic language ability.” A skills-based definition of AL is seen as a potentially limited, restrictive view of language.

(11)

10

into intervention courses to promote chances for success in tertiary education. In 2004, however, the ELSA PLUS was replaced with the Test of Academic Literacy Levels (TALL), which is designed based on the interpretation of AL by the main test designers, Van Dyk and Weideman (Van Dyk and Weideman, 2004a) that was outlined in the previous section. TALL is owned today by a consortium of four South African universities (Pretoria, Free State, Stellenbosch and North-West), and is widely used. Some 32000 students wrote TALL and its Afrikaans counterpart, the Toets van Akademiese Geletterheidsvlakke (TAG) in 2011.

TALL now is used for assessment and placement purposes at various universities in Africa (i.e. by the four mentioned above, as well as at the University of Namibia) (Van der Slik and Weideman, 2007). The test is used to assess the AL of first-year undergraduates. In their understanding, the assessment of AL is the assessment of these students’ ability to use the academic language appropriately in specific academic situations and communities, and their level of academic discourse competence in doing so (Cliff and Yeld, 2006; Van Dyk and Weideman, 2004a).

2.4 What are the qualities of language tests?

Regarding the desired qualities of a language test, Bachman and Palmer (1996) state that the most critical quality of such an assessment instrument is its usefulness, since this quality can enable their evaluation of a test in all aspects of test development and uses (p.17). Their definition of test usefulness can be summarized in the following figure:

Usefulness = Reliability + Construct validity + Authenticity + Interactiveness + Impact + Practicality

(Adapted from Bachman and Palmer, 1996, p.18)

(12)

11

The first of these, according to Bachman and Palmer, is reliability. This quality can be defined as the consistency of measurement in the test scores of the test-takers under different testing conditions. For example, if it is the same individual, let us say a high school student, then her test scores would be consistent across different administrations of the same test. Or suppose a school wants to use the two test formats to test the same group of students interchangeably, then there should be no difference in the performance of an individual on either test format. Another example could be a set of language tests to classify the language levels of the same group of learners. If the scores from the same test formats do not rank the same individuals in the same order then they are not reliable. A particularly difficult type of reliability to achieve concerns inter-rater reliability. If, for example, the test answer constitute a stretch of academic writing, which should be rated consistently and on the same criteria by the same or different markers, this might require not only a huge investment in drafting as clear a memorandum or marking protocol as is possible, but also instituting sometimes elaborate and potentially costly processes of moderation, in order to retain as high a level of consistency in and between raters, and across the work of different test-takers. Unreliability here means that different candidates may be unfairly treated. Reliability may also therefore be viewed as consistency, or a lack of variation, which every test developer wants to avoid. Though it is impossible to get rid of variation entirely, a test developer strives to minimize the factors which are under our control that contribute to inconsistency. In other words, among objective factors that cannot be controlled, we can control characteristics of test tasks to improve the test consistency. Also, effects of variations on the test scores need to be estimated so that we can know to what extent we have been successful in minimizing these variations in designing a test (Bachman and Palmer, 1996, p.20).

(13)

12

scores. Hence, when we interpret a test score, we will have to interpret it as an indicator of the ability that is to be measured in accordance with a certain domain of generalization. Construct validation is an ongoing process to show that a specific interpretation of test scores is justified and consists of the evidence to justify that interpretation. Thus, to consider the construct validation of a test score interpretation, it is necessary to look at both the construct definition and the test task characteristics.

Authenticity, the third quality of test usefulness, also has to do with the characteristics of test tasks. Authenticity is defined as the correspondence of the characteristics of the test task to those of the target language use task. Authenticity is a critical quality of a language test, because test-takers’ perceptions of the authenticity of test tasks might affect their performance. For example, if we want to develop a reading comprehension test, we may have to ensure that the topics chosen in the test are familiar to examinees in non-testing or real life situations. Consequently, thanks to this familiarity, test-takers can perform at their best. Bachman and Palmer (1996) state that authenticity relates to construct validity by providing a means to examine the extent to which interpretations of scores may be generalized. And studying the generalizability of score interpretations is a part of construct validation. Since different test developers may have different approaches in designing tests, and different definitions of domains of the target language use may exist, the determination of the authenticity of test tasks can be approached from various perspectives. Accordingly, speaking of authenticity is “relatively more” or “relatively less” authentic, not merely authentic or inauthentic (Bachman and Palmer, 1996, p.28).

(14)

13

The expected impact of the test on the test-takers, test developers, educational systems and the whole of society is yet another quality of a language test foregrounded by these authors. Bachman and Palmer (1996, p.31) categorize this impact into two types: micro level and macro level. Impacts at micro level refer to its effects on individuals that can be directly affected, i.e. test-takers, test developers, test-users, teachers, and decision-makers; or indirectly affected by the test: i.e. future classmates or employers of test-takers. Meanwhile, the macro level refers to the education system or to society at large. Every test score will be used for certain purposes, and hence has certain consequences, for example: to place students into suitable courses, to decide “a pass” or “a fail”, or to select the top candidate. When developing a test, test designers or test users should therefore consider carefully the test goals, values and specific settings to be able to predict the consequences of the test-taking and test scores for the education system, for society, and for the individuals involved. Bachman and Palmer (1996, p.35) suggest that in order to promote the positive impact of a test, test-takers should be engaged in the process of developing the test, since their perceptions of the test tasks can affect their use of strategies to complete the test. In other words, because the test task characteristics affect test-takers’ perceptions and performance, their involvement in designing and developing a test may bring about test tasks that are more authentic and interactive.

(15)

14

Finally, practicality is a quality of a language test that should be attended to. Practicality refers to the relationship between the available resources and the required resources necessary for the design, development, implementation and use of the test. Resources can be defined as human resources, i.e. test designers, raters, coordinators, administrators, IT assistants; or material resources, i.e. test rooms, equipments like overhead projectors, computers, etc. or materials like paper and pencils; or time to develop the test and time to grade the test or document test scores. Practicality can be represented with the following formula:

Practicality =

If practicality ≥1, the test development and use is practical. If practicality <1, the test development and use is not practical.

(Adapted from Bachman and Palmer, 1996, p.36)

In case required resources exceed the available resources, for example human resources or material resources, test designers should modify the test specifications, or they can look for ways to increase the available resources to meet the demand. Practicality also refers to logistical constraints, for example having to use a test format that delivers test results in a shorter space of time. At the South African universities referred to beforehand, for instance, the administrative need for the results of AL tests to be available within 18-24 hours of being written, in order to place enrolling students on appropriate language support courses, precludes the use of elaborate and time consuming hand-marking procedures, and forces the test designers to use a multiple-choice format as imaginatively as they can, to deliver results in time.

Since the above six qualities relate and affect each other, it follows that it is necessary to find an optimal balance between reliability, construct validity, interactiveness, impact and authenticity with respect to the resources available to design and develop a test.

2.5 Academic literacy and reading comprehension skill

(16)

15

Currently, most models of reading comprehension view reading as a complex process comprising of several sub-processes, i.e. syntactic parsing, lexical decoding, or activation of background knowledge, interacting with each other in two opposite ways: bottom-up and top-down. In the former, readers proceed from the bottom level to higher levels without intervention from information excluded in the text or from higher levels. For example, readers will proceed from letters to words, then from words to sentences, from sentences to paragraphs, and then to the text as a whole. Meanwhile, in the latter, the process is reversed in which readers will use their background knowledge to interpret the text. In this sense, readers play an active part in top-down processing, which is thus seen as reader-based (Hulsker, 2002). Alderson (2000, p.18-19) states that sub-processes at higher levels can have impacts on ones at lower levels. For instance, readers can use background knowledge and context to interpret or guess meanings of a word or a sentence. Nowadays, a number of studies have shown that reading is neither bottom-up or top-down, independently. Models coordinating the two processes have as a result been proposed. Of these the most influential one is Stanovich (2000). His interactive model engages both bottom-up and top-down processing in which sub-processes interact with each other depending on the reader’s strengths and weaknesses. That explains why interaction between these sub-processes varies from individual to individual.

(17)

16

L2/FL can influence their interpretations and perceptions of words or phrases or the author’s implications when the cultures are totally dissimilar. However, this is less a problem when readers are from the same or similar cultures as the authors, or from cultural settings of the text that are congruent (Urquhart and Weir, 1998, p.33).

According to Grabe (2000, p.226-231), L2 reading is an interactive process, since during the reading process, readers have to interact constantly with the text via his background knowledge and cultural background. This interactive process consists of the use of cognitive and metacognitive processes which can occur at various linguistic levels, such as the phonological, morphological, syntactic, semantic and discourse. Grabe (2002) also notes that good L2 readers are quick and efficient at recognizing words - an important factor which can be used to measure L2 reading comprehension skill.

In alignment with the above views, Nambiar (2005) observes that academic reading involves more than identifying main ideas in a passage; rather, it engages readers’ coordinating their L2 reading with their own background knowledge so that they can achieve a certain level of understanding of the text. Alderson (2000) defines poor readers as those who do not know many [reading] strategies and do not know how and when to apply the knowledge they have; and good readers as those who do know. Also, good readers can apply these strategies successfully to solve comprehension problems which they recognize.

(18)

17

that reading proficiency and academic success is positively related. That is to say, effective strategies in L2 reading could help learners perform better in terms of L2 academic literacy assessment which involves academic reading comprehension.

These studies have consequences, therefore, for the way in which we should assess academic literacy, especially the levels of competence of first-time, new entrants. They indicate that a test that, while taking a broad view of AL as outlined above, will be appropriate if there is also a focus on reading ability, or, better: a test that is appropriate because it integrates the measurement of AL with the assessment of reading skill and strategy.

2.6 Research questions

The above definitions and discussion indicate that the conception of AL and its possible measurement with TALL are potentially aligned with and appropriate for the study of the AL levels of first-year students at the College of Foreign Languages (CFL), at the University of Danang, Vietnam. Hence, this study hopes to investigate the answers to the following research questions:

1. Is there any difference between the scores of students of the CFL and students of UP (University of Pretoria) in the Test of Academic Literacy Levels (TALL)? How big or significant are those differences?

2. If differences are found, how should we interpret them, and what does this tell us about the appropriateness of TALL for the CFL?

3. How can TALL be modified to be appropriate for the students of CFL in terms of assessing literacy testing?

Accordingly, in order to answer those research questions, the sub-questions are:

1. What are the descriptive statistics of TALL scores of students of the CFL and UP?

2. Are the scores of students of CFL comparable with those of students of UP?

(19)

18

CHAPTER THREE: RESEARCH METHODOLOGY

3.1 Introduction

In this chapter, the descriptions of the test construct, administration procedure, participants, and data analysis are presented. The pilot testing is important as it can tell us about the reliability of the test; and the productivity of each item in the test. The definitions of these terms will be presented in detail in later sections.

3.2 Participants

The participants in this study – The Rubber pilot test are 197 first-year students at Department of English, College of Foreign Languages, University of Danang, Vietnam. English is a foreign language for them and also their major at this university. The class size is about 25 to 35 students per class. Most of these students have learnt English as a foreign language in Vietnam for seven years or more. There are usually more female students than male students, as in other English language classes in Vietnam. In this study, there are 178 female students (90%) out of a total 197 students. Yet, gender differences will not form part of this study. The participants were asked to declare their voluntary participation in the pilot test, and were given the choice of providing their full name. This is to maximize the confidentiality of the test-takers’ personal particulars.

Meanwhile, there are 1819 first year students of University of Pretoria who took a similar test last September in South Africa. These students come from a variety of language backgrounds, ranging from English to African languages and Afrikaans. They are also from different socio-economic and academic backgrounds. The purpose of the test is to identify new undergraduates who are at risk regarding low academic literacy levels, and to place them into supportive academic courses. Thus, this test works as a placement test, not a pass-or-fail test for these newly entered undergraduates.

3.3 Pilot testing materials 3.3.1 Construct of TALL

(20)

19

is based on the construct of academic literacy discussed earlier. The following subtests are adapted from those outlined in Van Dyk and Weideman (2004a).

Section 1: Scrambled text (there will be a scrambled paragraph that students have to rearrange to its original order).

Section 2: Interpreting graphs and visual information (this section examines students’ ability to interpret graphs, charts, diagrams, etc. and their capacity for quantitative literacy [numeracy] related to academic tasks).

Section 3: Text type. There are a number of sentences or phrases from various genres, which students have to match with sentences or phrases from the same text types.

Section 4: Academic vocabulary. Academic vocabulary is tested more closely in this section than in others.

Section 5: Understanding texts. This section normally consists of one or more extended reading passage(s). Then there are questions that measure the ability to distinguish between essential and non-essential information, or cause and effect, as well as for inference, sequence, define, handle metaphor and idiom, and so forth.

Section 6: Text editing. This section, based on a modified cloze format, normally has three sub-parts, though the same text is used. In the first sub-section, a word is omitted, and students have to indicate the place where it is missing. In the second, the place where the missing word has been taken out is indicated, and students have to choose the appropriate word to fill in. In the third and final part, students have to indicate both the place and missing word. In the Rubber test, these separate parts have been collapsed and students need to say which word has been omitted and where.

(21)

20

considered, since this section is not always scored by the test administrators, who have a choice of marking it only in the case of, for example, borderline cases. Since such borderline cases are now being identified predominantly by empirical means (cf. Van der Slik and Weideman 2005), this choice is being exercised less and less by administrators.

According to Van der Slik and Weideman (2005), the test is quite stable and reliable regarding its construct. Furthermore, a number of subsequent studies have contributed to its refinement, to improve its validation and standardization. The funding to continuously develop and refine the tests is currently provided by the partnering institutions which make agreements about this issue yearly.

In the pilot TALL – the RUBBER test, there are seven sections with 100 questions, in multiple choice formats, including:

Section 1: Scrambled text (5 questions: 1-5)

Section 2: Vocabulary knowledge (10 questions: 6-15) Section 3: Verbal reasoning (5 questions: 16-20)

Section 4: Interpreting graphs and Visual information (10 questions: 21-31) Section 5: Register and text type (5 questions: 31-35)

Section 6: Text comprehension (45 questions: 36-80)

Section 7: Grammar and text relations (20 questions: 81-100)

3.3.2 The personal profile questionnaire

There are three questions in the questionnaire section in the answer sheet. The purpose of these questions is to collect additional information about the students’ personal profile such as: gender, period engaging in learning English, whether or not taking such tests as TOEFL/IELTS/TOEIC.

3.4 Procedures

3.4.1 Purpose of the pilot

(22)

21

3.4.2 Administration of the test

Since the test consists of 100 questions, all of which are in multiple choice format, the students are allocated 90 minutes without any breaks to complete the test. The language of the test items is English.

According to Van Dyk (2006), since TALL is a standardized test, there is a compulsory administration procedure for the test. He details three steps to complete the TALL: before the test; during the test; and after the test. Accordingly, in the before the test stage, test-takers are informed about the procedures and rules to conduct the test. Also, required facilities such as pencils, sharpeners, erasers, registration forms, question booklets and answer sheets should be made available in the test room. In the during the test stage, there are four phases, from phase 1 to 4, with specific steps for both invigilators and test-takers to follow. Particularly, phase 1: Two invigilators should provide students with pencils, test administration forms (in the pilot case, a form to state they willingly take part in the test and let the researcher use their test results for research purposes). Then all examinees should be arranged in seats one row away from each other. Phase 2 begins with introducing the students to the aim of the test, the test construct, the time allocation for the test, and prohibition of cell phones in the test room. Phase 3 includes demonstration of completing the registration forms (in this case, the statement for letting the researcher use the test results for research purposes); and then distributing the question booklets as well as the answer sheets. Then invigilators will give instructions on how to complete the required fields in the answer sheets, such as full name, test date, years of studying English, etc. The test commences with Phase 4. One and a half hour is allocated for completion of the test. Invigilators should inform the test-takers of the time every half an hour. In the after the test stage, invigilators should collect both the question booklets and the answer sheets; count and recount the number of these materials. This phase is also the final phase of the whole test administration (adapted from Van Dyk, 2006).

3.5 Data analysis

(23)

22

facility values and discrimination indices for each individual subtest item (Van der Slik and Weideman, 2005, p.24).

3.5.1 Test and item analysis (the Iteman program)

Since the second research question (the appropriateness of TALL to assess academic literacy of CFL students) refers to the reliability of the TALL, the process of validation for this pilot TALL would include an item analysis, performed by the Iteman program, which is used to examine the contribution of each item to the test as a whole. With this analysis, we will be able to determine how well each item distinguishes weak examinees from strong ones as well as how easy or difficult a test item is. Moreover, setting specific parameters of acceptability will help indicate items that perform well or badly (Van der Slik and Weideman, 2005, p.24). According to Hughes, (2003, p.225) in item analysis, items identified as inefficient will be removed or modified. This process involves the calculation of facility values and a discrimination index and an analysis of distracters in multiple choice items. Facility values are defined as the percentage of correct answers for the whole of the test population. For example, if there are 100 examinees taking a test, and 45 people give the correct answer, the facility value is 0.45. In developing a placement test for the purpose of placing test takers in various courses at different levels, the facility values should show a wide range of values instead of big gaps between scores (Hughes, 2003, p.225).

(24)

23

should be taken into account is that the low positive index test items might not necessarily have to be removed. Instead, apart from the discrimination index of items, it is also a good idea to consider whether an item may have been too easy or too difficult, because the former motivates test-takers, while the latter can help to discriminate among the strongest test takers. The size of the sample of test-takers is another matter to consider. If the size is a bit small, say 30 or so, the discrimination index might not be very meaningful (Hughes, 2003, p.226-228). So test developers should be very careful in deciding which items to throw out and which ones to keep when revising their test. In other words, items with a high discrimination index will be productive in the test, while items with a negative discrimination index will usually not be; and more importantly, the items that are not likely to measure within the desired parameters of acceptability may need to discard as well (Van Dyk and Weideman 2004).

The hypotheses when running the Iteman program are: (1) The test is reliable and is appropriate for assessing the target groups of examinees; (2) The general ability of pilot test takers matches the general level of difficulty of the test items; (3) The item coefficients (discrimination indices) satisfy specific criteria (adapted from the validation model suggested by Van der Walt and Steyn, 2007). Accordingly, a subsidiary hypothesis will be: a factor analysis of the test will reveal an appropriate level of homogeneity or heterogeneity.

3.5.2 Factor analysis (the Tiaplus program)

A factor analysis was also used to investigate the possible existence of clusters of variables with the help of the Tiaplus program. Factor analyses are indeed a collection of statistic methods to study the way underlying constructs influence the responses based on measured variables (DeCoster, 1998, p.1). Van der Slik and Weideman (2005, p.26) claim the use of factor analyses is to gain insight into the consistency of items in a test and to see if they are homogeneous or one-dimensional in what they set out to measure, or whether, if they are heterogeneous, there is still an adequate measure of homogeneity. In tests with a wide ranging construct, it is true extent or level of homogeneity that is at issue.

(25)

24

(26)

25

CHAPTER FOUR: RESULTS AND DISCUSSION

4.1 Introduction

This chapter has two parts. The first will present and analyze the test scores of the TALL of students of CFL and UP, which turns subsequently to the validation of the TALL for students of CFL. This will be carried out in three steps: (1) Interpreting the personal profiles of the pilot test takers; (2) Analyzing the raw scores of TALL of students of CFL and UP with test and item analyses and a factor analysis; (3) Interpreting the collected data, illustrated by the use of graphs and tables. The collection of the personal profiles was done through a questionnaire section enclosed in the test answer sheets. The data analysis was done using statistics programs like SPSS, Iteman and Tiaplus to study the reliability and construct validity of TALL for students of CFL. The second part of this chapter houses the discussion with respect to the research questions, the hypotheses and the statistical results of analyses generated by the above programs.

The research questions relate to the appropriateness of the TALL for CFL students. In this paper, the term appropriateness of the TALL entails the study of the test construct validity and its reliability. Such a high-stakes test as TALL will be reliable if the alpha or internal consistency is higher than 0.7 (Van Dyk and Weideman, 2004b). The test construct validity will be investigated via the correlations between subtests of the test. Van der Walt and Steyn (2008: 196) claim that preferable correlation values should be between 0.15 and 0.5. Otherwise, if these values are beyond these parameters, e.g. 0.8 or 0.9, then the subtests probably may measure the same concepts; yet correlation values of lower than 0.15 may negatively affect the integrity of the test as a whole. In the case of TALL, a multidimensional test with dissimilar sub-scales or notions, we look for low correlation values (from 0.15 to 0.5) to maintain its multidimensionality.

4.2 Participants’ Personal Profile

(27)

26

The questions include: (1) Are you male or female?; (2) How long have you been studying English? (3) Have you ever taken the TOEFL/IELTS/TOEIC exams? (4) If Yes, than please provide your score. The first three questions are in multiple choice formats. The responses to these questions are presented in the tables below:

Gender Male Female

No. of students 19 178

Percentage (%) 10 90 Table 1: Responses for question 1

Years 0-3 years 4-7 years > 8 years

No. of students 1 84 112

Percentage (%) 0.20 42.60 56.90

Table 2: Responses for question 2

Exams TOEFL IELTS TOEIC Not yet

No. of students 3 1 2 191

Percentage (%) 1.50 0.50 1.01 96.95

Table 3: Responses for question 3

(28)

27

4.3 PILOT TEST RESULTS AND INTERPRETATION 4.3.1 Overview of test scores of CFL and UP students

Table 4 presents the table of frequency of the test scores of CFL students. Accordingly, there is one test taker who scored 30 (the minimum score) and the same for the maximum score of 83.

Scores Frequency Scores Frequency

30 1 57 8 31 0 58 6 32 0 59 7 33 0 60 5 34 1 61 1 35 2 62 1 36 0 63 0 37 6 64 3 38 9 65 2 39 10 66 2 40 8 67 2 41 8 68 3 42 6 69 3 43 7 70 1 44 6 71 2 45 14 72 0 46 8 73 1 47 8 74 1 48 6 75 1 49 7 76 0 50 3 77 0 51 6 78 0 52 7 79 0 53 9 80 0 54 4 81 0 55 4 82 0 56 7 83 1

(29)

28

Since there are a large number of different scores, a histogram of the above frequency table of distribution is used to get a general picture of the scores. Figures 1 and 2 are the histogram presenting distribution of CFL and UP students’ scores on the piloting TALL.

Figure 1: Histogram of distribution of CFL students’ scores on TALL

Figure 2: Histogram of distribution of UP students’ scores on TALL

(30)

29

frequency histogram of UP students’ scores on TALL is a bit right-skewed, with a mean of 55.25 and a skew of .031 and kurtosis of -0.026.

4.3.2 Comparison TALL scores of students of CFL and UP

In order to answer the first research question (Are there any differences in the scores on TALL of students of CFL and students of UP?), we need to compare the two sets of scores more closely (one of CFL students and the other of UP students). We use scale analysis generated by the Iteman to indicate the ways the scores are distributed around the mean. Table 5 shows the scale analysis of the two sets of scores.

CFL UP Number of items 100 100 Number of examinees 197 1819 Mean 49.69 55.25 Standard Deviation 9.409 10.542 Variance 88.528 111.133 Minimum 30 20 Maximum 83 89 Alpha 0.774 0.831

Standard Error of Measurement 4.472 4.328

Skew 0.690 .031

Kurtosis 0.037 -0.026

Table 5: Scale statistics for scores of students of CFL and UP on TALL

As can be seen in Table 5, the range of scores of UP students (69) is a bit wider than that of CFL students (53). The range is the difference between the highest and lowest scores. In the two sets of scores, the maximum score of CFL students is only 84 while this number is 89 for UP students. Yet, the minimum score of CFL students is 30 whereas this value for UP students is a bit lower at 20. This means all the scores of CFL students are in the range of 53 (from 30 to 83) while all the scores of UP students fall in the range of 69 (from 20 to 89). Also, the scores of the CFL students (M=49.69, SD=9.409) are likely to be a bit more homogeneous than those of the UP students (M=55.25, SD=10.542).

(31)

30

4.3.3 Results from Tiaplus program for the TALL scores of CFL students and UP students

The second research question concerning the appropriateness of TALL for CFL students entails a process to validate the TALL regarding its reliability and construct validity. The output of the Tiaplus program analysis are all in Dutch, thus we translated only the statistical terms necessary for the TALL validation process into English, as presented in the three tables below. (For more detailed results, see Appendix)

Number of persons in the test 197 Number of items 100

Minimum test score 0 Maximum test score 100

Average test score or Mean 49.69 Standard deviation 9.41

Cut-off score 59.5 Percentage of failing 86.29

Coefficient Alpha 0.77 Standard error alpha 0.02

Average P-value 50 Standard error of measurement 4.46 Table 6: Results from Tiaplus for CFL students’ scores

Number of persons in the test 1819 Number of items 100

Minimum test score 0 Maximum test score 100

Average test score or Mean 55.25 Standard deviation 10.54

Cut-off score 59.5 Percentage of failing 65.7

Coefficient Alpha 0.8 Standard error alpha 0.01

Average P-value 56.65 Standard error of measurement 4.7 Table 7: Results from Tiaplus for UP students’ scores

Number of persons in the test 2016 Number of items 100

Minimum test score 0 Maximum test score 100

Average test score or Mean 53.73 Standard deviation 10.25

Greatest Lower Bound 0.91 Average Rit 0.24

Cut-off score 32.5 Percentage of failing 1.39

Coefficient Alpha 0.82 Standard error alpha 0.01

Average P-value 53.73 Standard error of measurement 4.31

Table 8: Results from Tiaplus for UP + CFL students’ scores

As can be seen in Table 8, the total number of students taking the pilot test is 2016 (197 CFL students and 1819 UP students).

The overall coefficient alpha is 0.82, which is excellent for a pilot test. This alpha refers to Cronbach’s alpha – the coefficient of reliability or internal consistency of a test. Cronbach’s alpha could be higher after we throw out certain items whose discrimination index is either negative or too low.

(32)

31

from Cronbach’s alpha in a way that the Cronbach’s alpha presents the extent to which the observed scores depict the true scores (i.e. without measurement error); meanwhile, the GLB will be higher than the Cronbach’s alpha if the concept of the test is multidimensional (Van der Slik and Weideman, 2005, p.26). Since academic literacy is not an entirely homogeneous construct, but one with a rich variety; thus, the test of academic literacy should accommodate a suitably rich construct, and not a homogeneous construct like some of other ability tests (Van Dyk and Weideman, 2004a). In this pilot, the GLB is 0.91 which is higher than the coefficient alpha and satisfactory for a pilot test with multidimensionality.

The cut-off score, as shown in Table 8, is used to estimate the misclassifications which refer to the percentages of test takers who fail the test when they are able to pass and vice versa. Cut-off score is actually a point on the scale of the test scores halfway between the failing score and the lowest passing score. UP normally cuts at 0.15 to 0.29 standard deviation below the mean (for historical reasons). For example, if the highest failing score is 23 and the lowest passing score is 24 then the cut-off score will be 23.5. In this paper, the cut-off score is 32.5 for the two sets of scores which means people with scores below this point will be regarded as fail.

The percentage of failing is the percentages of examinees who do not reach the lowest passing scores or in other words, they fail the test. In this pilot, 1.39% out of 2016 students who took this test failed.

(33)

32

positive, indicating there is relation between performance on the items and the test as a whole which is satisfactory for a pilot test.

We should also examine the characteristics of the subtests and the test to gain a better understanding of how each subtest contributes to the reliability of the test and its construct validity. The following table is a summary of the statistical results of the total test and the subtests via Tiaplus program.

No. of examinees 2016

Test section Total test 1 2 3 4 5 6 7

No. of items 100 5 10 5 10 5 45 20

Average test score 53.73 1.63 7.56 2.61 6.26 2.88 21.06 11.72 Standard deviation 10.25 1.45 1.77 1.04 1.77 1.10 5.48 4.44

SEM 4.31 0.81 1.25 0.99 1.25 0.77 2.95 1.81

Average Rit 0.24 0.76 0.44 0.47 0.43 0.66 0.27 0.50 Average P-value 53.73 32.60 75.64 52.24 62.61 57.58 46.81 58.59 Coefficient Alpha 0.82 0.69 0.50 0.10 0.50 0.51 0.71 0.83 Greatest Lower Bound 0.91 0.77 0.56 0.17 0.57 0.62 0.80 0.93

Table 9: Summary of statistical results of CFL and UP students’ scores on TALL Regarding Table 9, we observe that total number of examinees is 2016 students. The total number of items is 100 questions, divided into 7 sections (detailed in Chapter 3, Construct of TALL) with the number of questions in each section as follows: 5, 10, 5, 10, 5, 45 and 20 respectively.

Regarding the subtests, the coefficient alpha and the GLB for the subtest 3, Verbal reasoning, are so low (0.10 and 0.17 respectively) that they might indicate the worst performing items within the test regarding reliability while these values for other subtests are pretty satisfactory. It is an early indication, therefore, that this whole subtest may either need to be omitted from a subsequent, refined version of the test, or be re-developed and re-piloted.

4.3.4 Dimensionality

(34)

33

Figure 3: Factor analysis of the pilot TALL

As shown in Figure 3, most items cluster in the upper left corner of the graph, yet, the items in subtest 7, Grammar and Text relations, are situated in the lower left corner and the middle right corner which indicates that the test is multi-dimensional. According Van der Slik and Weideman (2005), a possible explanation for the items in subtest 7 to perform differently is that they may measure different aspects of academic literacy (Van der Slik and Weideman, 2005, p.29). This is much in line with the underlying concepts of the test authors’ perceptions of the construct for a test of academic literacy which is a complex, varied and rich notion. Thus the construct of the TALL is probably varied and not a single homogeneous concept. Van der Slik and Weideman (2005) provide a detailed explanation for the dimensionality of the construct of the TALL, and express satisfaction with the level of homogeneity that is evident from such factor analyses as the above. Incidentally, this factor analysis shares exactly the gull-wing shape that we have found in other versions of TALL.

4.3.5 Discrimination item and alpha if items removed via Tiaplus program

We then study more closely the correlation between the performance of the test as a whole and the performance on individual items with the help of the Tiaplus program.

(35)

34

the test. On the contrary, if this alpha goes higher when the item is removed, it is not discriminating as well as the items where the alpha remains the same. (Rit is explained in detail in the above section). The tables below present the Rit and AR of the subtests and of the test.

Item Rit AR Item Rit AR Item Rit AR Item Rit AR

1 .35 .82 26 .7 .82 51 .16 .82 76 .22 .82 2 0 .82 27 .35 .82 52 .29 .82 77 .14 .82 3 .26 .82 28 .20 .82 53 .26 .82 78 0 .82 4 .33 .82 29 .22 .82 54 .32 .82 79 .10 .82 5 .27 .82 30 .28 .82 55 .20 .82 80 .14 .82 6 .13 .82 31 .27 .82 56 .21 .82 81 .31 .82 7 .21 .82 32 0 .82 57 .8 .82 82 .32 .82 8 .25 .82 33 .16 .82 58 .24 .82 83 .36 .82 9 .27 .82 34 .21 .82 59 .18 .82 84 .39 .82 10 .22 .82 35 .26 .82 60 .15 .82 85 .32 .82 11 .20 .82 36 .38 .82 61 .31 .82 86 .32 .82 12 .20 .82 37 .36 .82 62 .13 .82 87 .40 .82 13 .27 .82 38 .25 .82 63 .36 .82 88 .35 .82 14 .7 .83 39 .19 .82 64 .32 .82 89 .30 .82 15 .11 .82 40 .22 .82 65 .15 .82 90 .25 .82 16 .21 .82 41 .30 .82 66 .15 .82 91 .35 .82 17 .24 .82 42 .21 .82 67 .31 .82 92 .34 .82 18 -.2 .83 43 .20 .82 68 .31 .82 93 .39 .82 19 .27 .82 44 .18 .82 69 .25 .82 94 .36 .82 20 .14 .82 45 .19 .82 70 .19 .82 95 .37 .82 21 .12 .82 46 .27 .82 71 .22 .82 96 .36 .82 22 .19 .82 47 .36 .82 72 .19 .82 97 .33 .82 23 .27 .82 48 .22 .82 73 1 .82 98 .28 .82 24 .18 .82 49 .15 .82 74 .7 .82 99 1 .83 25 .10 .82 50 .30 .82 75 .31 .82 100 .2 .82 Table 10: Item discrimination and alpha if item removed of the test

(36)

35

While the AR for most of the items are .82, the AR value for items 14, 18 and 99 are .83. This points out that these items do not discriminate between strong students and weaker ones as other items do.

Item Rit AR 1 .71 .64 2 0 .74 3 .77 .59 4 .82 .54 5 .71 .64

Table 11: Item discrimination and alpha if item removed of section 1

Item Rit AR Item Rit AR

6 .42 .47 11 .45 .48

7 .41 .49 12 .44 .46

8 .50 .45 13 .48 .46

9 .48 .45 14 .37 .52

10 .52 .44 15 .29 .53

Table 12: Item discrimination and alpha if item removed of section 2

Item Rit AR 16 .46 .8 17 .50 .4 18 .43 .18 19 .46 1 20 .50 .10

Table 13: Item discrimination and alpha if item removed of section 3

Item Rit AR Item Rit AR

21 .38 .51 26 .26 50

22 .37 .47 27 .59 .41

23 .45 .47 28 .50 .44

24 .39 .50 29 .47 .45

25 .31 .51 30 .55 .42

Table 14: Item discrimination and alpha if item removed of section 4

Item Rit AR 31 .66 .50 32 0 .55 33 .60 .43 34 .63 .41 35 .73 .36

(37)

36

Item Rit AR Item Rit AR Item Rit AR Item Rit AR

36 .41 .70 47 .38 .70 58 .29 .70 69 .32 .70 37 .39 .70 48 .8 .70 59 .24 .71 70 .26 .71 38 .26 .71 49 .17 .71 60 .19 .71 71 .27 .70 39 .25 .71 50 .37 .70 61 .37 .70 72 .24 .71 40 .24 .71 51 .21 .71 62 .19 .71 73 .5 .71 41 .28 .70 52 .34 .70 63 .42 .70 74 .10 .71 42 .23 .71 53 .31 .70 64 .38 .70 75 .36 .70 43 .26 .71 54 .38 .70 65 .22 .71 76 .29 .70 44 .22 .71 55 .25 .71 66 .22 .71 77 .20 .71 45 .24 .71 56 .26 .71 67 .7 .70 78 0 .71 46 .31 .70 57 .13 .71 68 .38 .70 79 .12 .71 80 .18 .71 Table 16: Item discrimination and alpha if item removed of section 6

Item Rit AR Item Rit AR Item Rit AR Item Rit AR

81 .53 .82 86 .50 .83 91 .52 .82 96 .60 .82 82 .55 .82 87 .60 .82 92 .52 .83 97 .55 .82 83 .57 .82 88 .46 .83 93 .57 .82 98 .39 .83 84 .58 .82 89 .42 .83 94 .58 .82 99 .14 .84 85 .50 .83 90 .39 .83 95 .64 .82 100 .14 .84

Table 17: Item discrimination and alpha if item removed of section 7

Table 11, 12, 13, 14, 15, 16, and 17 illustrate more specifically the Rit and AR values of items in the TALL. Apparently, the AR values in sections 6 and 7 do not range as widely as those in other sections, which indicate that these items are likely to have more discriminating power than others. Besides, the Rit values in these sections are pretty satisfactory (except for that of item 78) indicating that students with high total test scores tend to score better in these subtests than their fellows whose total test scores are a bit lower. Regarding other subtests, items 2, 14, 15, 19, 21, 25, and 32 have higher AR, which indicate they do not discriminate among test takers as well as other items do.

4.3.6 Item discrimination index and facility values generated by Iteman (Attached in the

Appendix).

(38)

37

items should be removed or adjusted, we should be careful when making decision regarding the small size of test takers for the pilot test. Apart from that, these values should be aligned with the TALL construct, e.g. the definition of academic literacy and academic literacy testing should be taken into account. The table below shows the items with both low discrimination indices and facility values.

No. Section Item Disc.Index Fac.Value No. Section Item Disc.Index Fac.Value

1 1 3 .17 .09 30 6 38 .12 .01 2 2 5 .12 .07 31 42 .13 .13 3 6 .15 .14 32 43 .18 .14 4 7 .15 .09 33 44 .24 .17 5 8 .21 .13 34 48 .11 .10 6 9 .30 .15 35 50 .20 .16 7 10 .05 .04 36 56 .10 .05 8 11 .18 .13 37 59 .08 .06 9 13 .18 .16 38 60 .16 .10 10 14 .09 .04 39 62 .21 .13 11 15 .05 .02 40 64 .12 .09 12 3 16 .13 .03 41 65 .07 .01 13 17 .05 .04 42 66 .16 .01 14 18 .17 .05 43 71 .22 .17 15 19 .20 .15 44 72 .10 .14 16 20 -.00 -.06 45 73 .06 .05 17 4 21 .23 .16 46 74 .10 .05 18 22 .03 .03 47 75 .18 .10 19 23 .18 .16 48 77 .08 .08 20 25 .17 .12 49 79 -.04 -.07 21 26 .05 .01 50 7 81 .22 .13 22 27 .09 .06 51 87 .22 .13 23 28 .07 .03 52 88 .28 .13 24 29 .07 .09 53 89 .12 .06 25 30 .06 .01 54 95 .24 .11 26 5 31 .09 .08 55 98 .16 .12 27 32 .12 .08 56 99 .03 .03 28 33 .17 .15 57 100 -.08 -.14 29 34 .24 .14

Table 18: Items with low facility values and low discrimination index (No.: Number

Dis.Index: Discrimination index; Fac.Value: Facility value; Section 1: Scrambled text;

Section 2: Vocabulary knowledge; Section 3: Verbal reasoning;

Section 4: Interpreting graphs and visual information;

Section 5: Register and text type; Section 6: Text comprehension;

(39)

38

4.3.7 Correlations between sections of the test from Tiaplus program

Subtest Total test 1 2 3 4 5 6 7

Scrambled text 1 0.40

Academic vocabulary 2 0.43 0.15

Verbal reasoning 3 0.34 0.12 0.15

Graphic and visual 4 0.47 0.16 0.09 0.14

Register and text 5 0.35 0.15 0.15 0.10 0.14

Text comprehension 6 0.82 0.27 0.28 0.24 0.29 0.22

Grammar and text 7 0.63 0.08 0.10 0.07 0.16 0.10 0.24 Table 19: Intercorrelations between the subtests and total test

According to Van der Walt and Steyn (2008, p.196), preferable subtest inter-correlation values should be between 0.15 and 0.5. In particular, if these values are too high (e.g. >0.8) then maybe the subtests may all measure the same notions; and if they are too low, the integrity of the test as a whole should be questioned. Meanwhile, the correlation between subtests and the total test should be higher (around 0.7 or more) as “the total score is taken to be a more general measure of the attribute than is each individual section score” (p.196). Accordingly, as shown in Table 19, the correlations among the subtests are more or less satisfactory (from 0.15 to 0.5). Yet, the correlations are a bit low between subtest 1 and 3 (0.12); subtest 2 and 4 (0.09); subtest 2 and 7 (0.10); subtest 3 and 5 (0.10); subtest 3 and subtest 7 (0.07); subtest 5 and 7 (0.10).

Regarding correlation between subtests and the total test, correlation between subtest 6 and the total test, and subtest 7 and total test with correlation results of 0.82 and 0.63 respectively indicate a strong relation with the total test. The two subtests which have poor relations with the total test than others are section 3 and 5 with correlations of 0.34 and 0.35 respectively. This could indicate a suggestion of removing or adjusting these two subtests to strengthen such correlations (Van der Walt and Steyn, 2008). It is a second indication that the Verbal reasoning subtest is not a desirable section. It is, incidentally, one that is not normally part of TALL, but which had been added here for experimental design reasons.

4.4. DISCUSSION

4.4.1 Comparison of scores on TALL by students of CFL and UP

(40)

39

right. This suggests that more CFL students’ scores were rather low while students of UP scored a bit higher. This difference may result from the fact that while South African students take English as a second language, Vietnamese students learn English as a foreign language and thus the exposure to the target language is not as large as in South Africa. Additionally, the familiarity of the test format (multiple choice) could be an issue that makes a difference between the scores of the two groups of students.

4.4.2 Appropriateness of TALL to assess academic literacy of CFL students

Regarding the research question, which entails the validation of TALL as a placement test of academic literacy for students of CFL (and UP if any), it is obvious that the coefficient of internal consistency (0.82) and Greatest Lower Bound (.091) exceed the requirement for international high-stakes tests (>0.7). Yet, in terms of contribution of the subtests to the test as a whole, the low coefficient alpha and low GLB of subtest 3, Verbal Reasoning, might indicate that if we remove this subtest, the reliability and integrity among subtests and the test as whole would be higher (see Table 9).

The average difficulty (expressed in the P-value) of the test is equivalent with the general abilities of the students who took the piloted test (P-values = 50 and 53.73 for CFL students and the total samples from two universities).

As one of the purposes of the pilot test is to improve the reliability of the test for a certain group of students, i.e. CFL students, the conclusion to which items should be thrown out or modified is made based on the facility value and discrimination index (see Appendix). As set parameters of acceptability, the facility value should be between 0.2 and 0.8. Accordingly, there are 22 items that may have to be removed due to their low or negative facility value and low discrimination index from the scores of both CFL and UP students: 7, 10, 14, 15, 16, 17, 18, 20, 22, 26, 56, 59, 60, 65, 66, 73, 74, 75, 77, 79, 99, and 100. Also, there are 17 other items that may have to be left out due to their low or negative facility value and low discrimination index from the scores of CFL students: 3, 5, 19, 27, 28, 29, 30, 31, 32, 38, 42, 48, 64, 72, 89, 95, 96. Finally, the rest 18 items may have to be modified or adjusted to increase students’ performance and the appropriateness of the test for the purpose of assessing academic literacy levels of CFL new undergraduates (see Table 18).

(41)

40

pretty small (197 students). What is more, there is a remark regarding the first solution: low overall facility value result from difficult items which could contribute to the difficulty of the test for the target groups (Van der Slik and Weideman, 2005, p.32). Thus, in the refinement of the pilot test, the test designers may want to include a certain number of difficult items to make the test more challenging and accessible, especially when the purpose of test is to assess to diagnose which students are at risk at low academic literacy and not to decide “pass” or “fail”.

It is not worthy that those subtests with more items also correlate well with overall test score, as would be expected (since their size contributes more to their influence on the latter). One might therefore also have to refer to the statistics analyses generated by Tiaplus, that calculates the reliability if the subtest had a length of 40 items.

4.5 RECOMMENDATIONS

The second research question on the appropriateness of TALL for CFL students highlights the importance of validity (the test assesses the complex notion of the academic literacy of first year students), reliability or internal consistency (the reliability coefficient is 0.8 which is excellent for such a pilot test), construct validity/theoretical defensibility (the multidimensional test engages a construct of 7 different subtests to assess skills and knowledge differently based on the set definition of academic literacy). In this case, the section Verbal reasoning is best left out, due to its low reliability coefficient and very low correlations with other subtests. Simultaneously, section 5, Register and Text type, should be modified in a way that its relation with the total test would be a bit stronger.

Referenties

GERELATEERDE DOCUMENTEN

Ek is baie bly dat u gevestigde kapitaalbelange in- by hierdie gclcenthede het Oom eerstehandse getuienis gclcwer mckaar en dan word die aantal Paul dikwels

Die gerieflikste totale lengte van die ossweep vir elke individuele drywer word fundamenteel bepaal deur die lengte (aantal pare) van die betrokke span osse

(Fotokuns). Pas stel hy aan simbool van brute geweld, van mens, geleenthede om te kan die studente en hul intellektuele wellus en die redeloosheid van werk. ,Ons

Er zijn uiteraard veel meer variabelen in de wereld die invloed kunnen hebben op earnings management en fraude maar die zijn niet mee genomen in dit literatuur onderzoek omdat ze

Since the branch number of MixNibbles is 5, the minimum number of active bytes with the differential characteristic ∆ 3 will..

Concluding the argument, converging technologies and de-perimeterisation are similar in that both involve in their design assumptions the dissolution of boundaries, a shift

For the foreign holdings of gilts, the BoE’s holdings of gilts and the QE variable it was expected that all negative coefficients were expected as the portfolio balance effect is

Influential factors Natural resource Market size Techno- logical capability Labor cost Tax rate Agglomeration effect infrastructure Genres of research on influential factors