Vocabulary Size Tests for Swiss German Sign Language

Vocabulary Size Tests for DSGS

Within the frame of the Swiss National Science Foundation funded project Scalable Multimodal Sign Language Technology for Sign Language Learning and Assessment, two vocabulary size tests targeting beginning adult learners of Swiss German Sign Language (Deutschschweizerische Gebärdensprache, DSGS) were developed (Haug, 2017). The first test is a Yes/No test (YN test), the second a L1/L2 translation test (i.e., from written German to DSGS). Both tests consist of approximately 100 items. Both test formats, the YN test and translation test, are test methods that have not previously been developed and used in a sign language testing scenario (Haug et al., 2019).

The YN Test

Findings from YN tests for spoken language informed the development of the YN test for DSGS. For example, the basic design of YN tests consisting of a test taker seeing a word and indicating whether he/she knows the word (e.g., Beeckmans et al., 2001) was applied for the current project. In order to minimize the effect of guessing on the part of the test takers, i.e., a test taker can rate more words with “yes” than he/she actually knows (Read, 2000), Anderson and Freebody (1983) added pseudowords words to a YN test. This approach has also been applied for the current project.

Another issue addressed in the literature on YN test for spoken languages is how to interpret the test scores (Meara, 2005). In a YN test, the words a test taker chooses as “known” are called hits, while words rated as “unknown” are termed misses. Pseudowords rated as known are considered false alarms, and pseudowords checked as unknown are correct rejections (Stubbe, 2015). The simplest solution for interpreting the test scores is to combine the correct responses (hits) and correct rejections to a total score. Currently, there are three different approaches for dealing with the false alarm rate (Schmitt, 2010; Stubbe, 2015):

Setting a maximum amount of false alarm responses as a threshold for exclusion of a test taker: For example, Schmitt, Jiang, and Grabe (2011) suggest a 10% acceptance rate (3 out of 30 items), and Stubbe (2012) applied a 12.5% rate.
Adjusting the YN scores by using one of several different proposed formulas (see Stubbe, 2015). The simplest formula is to subtract the false alarm rate from the hit rate to arrive at a score that better reflects vocabulary knowledge (“true score”).
Applying a regression model to use the YN test scores to predict scores of translation tests (Stubbe & Stewart, 2012).

There does not seem to be a consensus as to what is the best approach or which formula in Approach 2 works best (Schmitt, 2010). For the purpose of this project, the second approach was applied (hit rate minus false alarm rate).

Beeckmans et al. (2001) report that there are no clear guidelines regarding the ratio of real words to pseudowords. For example, among the ratios reported in different studies are 30:3 (Schmitt & Zimmermann, 2002), 90:60 (Mochida & Harrington, 2006), 96:32 (Stubbe, 2015), 60:40 (Eyckmans, 2004), 72:28 (Harrington & Carey, 2009), and 40:20 (Meara, 1992). For the current project, 25 nonsense signs were developed, i.e., phonologically plausible forms of a sign language that bear no meaning in that language (Mann, Marshall, Mason, & Morgan, 2010). In total, the YN test consists of 125 signs, i.e., 100 real signs and 25 nonsense signs.

The L1L2 Translation Test

Production tests have been used as a means for verifying the test taker’s self-reported vocabulary in a YN test (Stubbe, Stewart, & Pritchard, 2010). For the purpose of this study, a production test was developed and used. The simple form of a translation test for spoken languages is that an L1 word is provided to the test taker and he or she produces the L2 translation (Laufer & Goldstein, 2004). Due to the absence of a widely accepted conventionalized writing system for sign languages (Boyes Braem, 2012), the L2 translation consists of a DSGS sign.

Stewart (2012) cautions that production tests that are hand-scored can result in an inconsistency between raters. An important issue concerning the development of scoring instruments is (1) to define a criterion of correctness and (2) to decide whether two (i.e., right/wrong) or more degrees of correctness (with partial credit) should be used (Bachman & Palmer, 1996). For assessing a single area of language knowledge, such as vocabulary, the right/wrong distinction can be useful, but when different areas of language knowledge are assessed, partial credit on more levels might be needed (Bachman & Palmer, 1996). This was applied for the current project.

Participants

Both vocabulary tests were administered to 20 adult test takers (5 male and 15 female). The test takers were between 24 and 55 (M = 39.3) years old at the time of testing. 19 of the 20 test takers were hearing; one had a Cochlear implant but had acquired German as a first language and was also learning DSGS as an adult. The majority of the test takers had one spoken language (e.g., a Swiss German dialect, or Standard German; n = 18) as their L1. Two participants reported having grown up with two spoken languages. All participants had learned DSGS as adults (range: 18-53 years old, M = 35.4).

Results
Procedure: The YN test was delivered online using Limesurvey (an open-source online survey tool), while the translation test was presented in the form of a PowerPoint presentation, with the German target word and a sentence to disambiguate its meaning. The test takers’ sign productions were videotaped for later evaluation by two trained raters.
The overall goal of the project was to investigate the psychometric properties of the two vocabulary size tests. The item analysis of the YN test and the two ratings of the translation test resulted in an overlap of seven items that were candidates for removal, i.e., that exhibited a facility value between .20 and .90 (Bachman, 2004) and/or a corrected item-total correlation of >.30 (Carr, 2011; Green, 2013; for more details see Haug, 2017). Internal consistency, investigated with Cronbach’s Alpha, was >.96 for the YN test and the translation test, for both raters separately. Inter-rater reliability was investigated with Cohen’s Kappa (Gwet, 2014). The agreement between the two raters was Kappa = .613, p <.001, which is considered “substantial” according to Landis and Koch's (1977) Kappa benchmark scale. Haug et al. (2019) argue that content validity could be established due to the sampling method of the items, which are based on beginning learners’ DSGS teaching materials (e.g., Boyes Braem, 2004). Due to the absence of any other DSGS vocabulary test to investigate concurrent validity, external variables (number of DSGS courses attended, self-judgment of DSGS skills, context(s) of DSGS learning) were identified and their contribution to the test takers’ performance on both tests investigated. The variable “number of courses attended” correlated significantly (rs between .528 and .601, p <.05) with the test scores on the YN test and the translation test, separately for both raters. The correlation between the YN test results and the translation test scores was statistically significant for both raters (r = .811 and .867, p < .001).

Strengths and weaknesses
Among the strengths of these two vocabulary size tests for DSGS is that it is the first time that vocabulary size tests for adult sign language learners have been developed and evaluated. Among the weaknesses is that both tests were only conducted with a small sample of 20 test takers. Additionally, the test has been evaluated within a framework of classical test theory. In the future, more modern approaches to test validation should be applied (e.g., Chapelle, 1999).

Update

The YN Test has been revised within the Swiss National Science Foundation funded project Scalable Multimodal Sign Language Technology for Sign Language Learning and Assessment - Phase II (SMILE-II). Currently data from adult L2 learners of DSGS are collected (including feedback how the tests supports the learners' own learning). The L1L2 Translation Test has also been revised and is also used in the SMILE-II project.

AUTHOR

Summarized by Tobias Haug (2019).

For more information regarding this test, please contact Tobias Haug at Sign Language Assessment Services, Zurich.

BIBLIOGRAPHY

Anderson, R. C., & Freebody, P. (1983). Reading comprehension and the assessment and acquisition of word knowledge. In B. Hudson (Ed.), Advances in reading/language research: A research annual (pp. 231–256). Grenwich, CT: JAI Press.

Bachman, L. F. (2004). Statistical analysis for language assessment. Cambridge: Cambridge University Press.

Bachman, L. F., & Palmer, A. (1996). Language testing in practice. Oxford: Oxford University Press.

Beeckmans, R., Eyckmans, J., Janssens, V., Dufranne, M., & Van der Velde, H. (2001). Examining the Yes/No vocabulary test: Some methodological issues in theory and practice. Language Testing, 18(3), 235–274.

Boyes Braem, P. (2004). Gebärdensprachkurs Deutschschweiz, Stufe 1. Linguistischer Kommentar [Sign language course DSGS, level 1. Linguistic comments]. Zürich: GS-Media/Schweizerischer Gehörlosenbund SGB.

Boyes Braem, P. (2012, October). Overview of research on the signed languages of the deaf. Lecture presented at the University of Basel. University of Basel.

Carr, N. T. (2011). Designing and analyzing language tests. Oxford: Oxford University Press.

Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied Linguistics, 19, 254–272.

Eyckmans, J. (2004). Measuring receptive vocabulary size: Reliability and validity of the yes/no vocabulary test for French-speaking learners of Dutch. LOT, Utrecht.

Green, R. (2013). Statistical analyses for language testing. Retrieved from http://www.dawsonera.com/depp/reader/protected/external/AbstractView/S9781137018298

Gwet, K. L. (2014). Handbook of inter-rater reliability. Gaithersburg, MD: Advanced Analytics.

Harrington, M., & Carey, M. (2009). The on-line Yes/No test as a placement tool. System, 37(4), 614–626. https://doi.org/10.1016/j.system.2009.09.006

Haug, T. (2017). Development and Evaluation of Two Vocabulary Tests for Adult Learners of Swiss German Sign Language. Master thesis. Lancaster: Lancaster University.

Haug, T., Ebling, S., Boyes Braem, P., Tissi, K., & Sidler-Miserez, S. (2019). Sign Language Learning and Assessment in German Switzerland: Exploring the potential of vocabulary size tests for Swiss German Sign Language. Language Education & Assessment, 2(1), 20–40. https://doi.org/10.29140/lea.v2n1.85

Landis, J. R., & Koch, G. (1977). The measurement of observer agreement for categorial data. Biometrics, 33, 159–174.

Laufer, B., & Goldstein, Z. (2004). Testing vocabulary knowledge: Size, strength, and computer adaptiveness. Language Learning, 54(3), 399–436.

Mann, W., Marshall, C. R., Mason, K., & Morgan, G. (2010). The Acquisition of Sign Language: The impact of phonetic complexity on phonology. Language Learning and Development, 6(1), 60–86. https://doi.org/10.1080/15475440903245951

Meara, P. (1992). EFL vocabulary test. Swandea, UK: Centre for Applied Language Studies.

Meara, P. (2005). Designing vocabulary tests for English, Spanish and other languages. In C. S. Butler, M. de los Á. Gómez González, & S. M. Doval-Suárez (Eds.), The dynamic of language use (Vol. 140, pp. 271–285). https://doi.org/10.1075/pbns.140.19mea

Mochida, A., & Harrington, M. (2006). The Yes/No test as a measure of receptive vocabulary knowledge. Language Testing, 23(1), 73–98. https://doi.org/10.1191/0265532206lt321oa

Read, J. A. S. (2000). Assessing vocabulary. Cambridge ; New York: Cambridge University Press.

Schmitt, N. (2010). Researching vocabulary. Retrieved from http://link.springer.com/10.1057/9780230293977

Schmitt, N., Jiang, X., & Grabe, W. (2011). The percentage of words known in a text and reading comprehension. The Modern Language Journal, 95(1), 26–43. https://doi.org/10.1111/j.1540-4781.2011.01146.x

Schmitt, N., & Zimmermann, C. B. (2002). Derivate word forms: What do learners know? TESOL Quarterly, 36(2), 145–171.

Stewart, J. (2012). A multiple-choice test of active vocabulary knowledge. Vocabulary Learning and Instruction, 01(1), 53–59. https://doi.org/10.7820/vli.v01.1.stewart

Stubbe, R. (2012). Do pseudoword false alarm rates and overestimation rates in yes/no vocabulary tests change with Japanese university students’ English ability levels? Language Testing, 29(4), 471–488. https://doi.org/10.1177/0265532211433033

Stubbe, R. (2015). Replacing Translation Tests With Yes/No Tests. Vocabulary Learning and Instruction, 4(2), 38–48.

Stubbe, R., & Stewart, J. (2012). Optimizing scoring formulas for yes/no vocabulary tests with linear models. Shiken Research Bulletin, 16(2), 2–7.

Stubbe, R., Stewart, J., & Pritchard, T. (2010). Examining the effects of pseudowords in yes/no vocabulary tests for low level learners. Language Education and Research Center Journal, 5, 1–16.