Vocabulary Size Tests for DSGS

Within the frame of the Swiss National Science Foundation funded project Scalable Multimodal Sign Language Technology for Sign Language Learning and Assessment, two vocabulary size tests targeting beginning adult learners of Swiss German Sign Language (Deutschschweizerische Gebärdensprache, DSGS) were developed (Haug, 2017). The first test is a Yes/No test (YN test), the second a L1/L2 translation test (i.e., from written German to DSGS). Both tests consist of approximately 100 items. Both test formats, the YN test and translation test, are test methods that have not previously been developed and used in a sign language testing scenario (Haug, Ebling, Boyes Braem, Tissi, & Sidler-Miserez, 2019).

 

The YN Test

Findings from YN tests for spoken language informed the development of the YN test for DSGS. For example, the basic design of YN tests consisting of a test taker seeing a word and indicating whether he/she knows the word (e.g., Beeckmans, Eyckmans, Janssens, Dufranne, & Van der Velde, 2001) was applied for the current project. In order to minimize the effect of guessing on the part of the test takers, i.e., a test taker can rate more words with “yes” than he/she actually knows (Read, 2000), Anderson and Freebody (1983) added pseudowords words to a YN test. This approach has also been applied for the current project.

Another issue addressed in the literature on YN test for spoken languages is how to interpret the test scores (Meara, 2005). In a YN test, the words a test taker chooses as “known” are called hits, while words rated as “unknown” are termed misses. Pseudowords rated as known are considered false alarms, and pseudowords checked as unknown are correct rejections (Stubbe, 2015). The simplest solution for interpreting the test scores is to combine the correct responses (hits) and correct rejections to a total score. Currently, there are three different approaches for dealing with the false alarm rate (Schmitt, 2010; Stubbe, 2015):

  • Setting a maximum amount of false alarm responses as a threshold for exclusion of a test taker: For example, Schmitt, Jiang, and Grabe (2011) suggest a 10% acceptance rate (3 out of 30 items), and Stubbe (2012) applied a 12.5% rate.
  • Adjusting the YN scores by using one of several different proposed formulas (see Stubbe, 2015). The simplest formula is to subtract the false alarm rate from the hit rate to arrive at a score that better reflects vocabulary knowledge (“true score”).
  • Applying a regression model to use the YN test scores to predict scores of translation tests (Stubbe & Stewart, 2012).

There does not seem to be a consensus as to what is the best approach or which formula in Approach 2 works best (Schmitt, 2010). For the purpose of this project, the second approach was applied (hit rate minus false alarm rate).

Beeckmans et al. (2001) report that there are no clear guidelines regarding the ratio of real words to pseudowords. For example, among the ratios reported in different studies are 30:3 (Schmitt & Zimmermann, 2002), 90:60 (Mochida & Harrington, 2006), 96:32 (Stubbe, 2015), 60:40 (Eyckmans, 2004), 72:28 (Harrington & Carey, 2009), and 40:20 (Meara, 1992). For the current project, 25 nonsense signs were developed, i.e., phonologically plausible forms of a sign language that bear no meaning in that language (Mann, Marshall, Mason, & Morgan, 2010). In total, the YN test consists of 125 signs, i.e., 100 real signs and 25 nonsense signs.

 

Video 1: Examples of the YN Test for DSGS (© Haug, 2017)

 

The L1L2 Translation Test

Production tests have been used as a means for verifying the test taker’s self-reported vocabulary in a YN test (Stubbe, Stewart, & Pritchard, 2010). For the purpose of this study, a production test was developed and used. The simple form of a translation test for spoken languages is that an L1 word is provided to the test taker and he or she produces the L2 translation (Laufer & Goldstein, 2004). Due to the absence of a widely accepted conventionalized writing system for sign languages (Boyes Braem, 2012), the L2 translation consists of a DSGS sign.

Stewart (2012) cautions that production tests that are hand-scored can result in an inconsistency between raters. An important issue concerning the development of scoring instruments is (1) to define a criterion of correctness and (2) to decide whether two (i.e., right/wrong) or more degrees of correctness (with partial credit) should be used (Bachman & Palmer, 1996). For assessing a single area of language knowledge, such as vocabulary, the right/wrong distinction can be useful, but when different areas of language knowledge are assessed, partial credit on more levels might be needed (Bachman & Palmer, 1996). This was applied for the current project.

 

Participants

Both vocabulary tests were administered to 20 adult test takers (5 male and 15 female). The test takers were between 24 and 55 (M = 39.3) years old at the time of testing. 19 of the 20 test takers were hearing; one had a Cochlear implant but had acquired German as a first language and was also learning DSGS as an adult. The majority of the test takers had one spoken language (e.g., a Swiss German dialect, or Standard German; n = 18) as their L1. Two participants reported having grown up with two spoken languages. All participants had learned DSGS as adults (range: 18-53 years old, M = 35.4).

 

Results
Procedure: The YN test was delivered online using Limesurvey (an open-source online survey tool), while the translation test was presented in the form of a PowerPoint presentation, with the German target word and a sentence to disambiguate its meaning. The test takers’ sign productions were videotaped for later evaluation by two trained raters.
The overall goal of the project was to investigate the psychometric properties of the two vocabulary size tests. The item analysis of the YN test and the two ratings of the translation test resulted in an overlap of seven items that were candidates for removal, i.e., that exhibited a facility value between .20 and .90 (Bachman, 2004) and/or a corrected item-total correlation of >.30 (Carr, 2011; Green, 2013; for more details see Haug, 2017). Internal consistency, investigated with Cronbach’s Alpha, was >.96 for the YN test and the translation test, for both raters separately. Inter-rater reliability was investigated with Cohen’s Kappa (Gwet, 2014). The agreement between the two raters was Kappa = .613, p <.001, which is considered “substantial” according to Landis and Koch's (1977) Kappa benchmark scale. Haug et al. (2019) argue that content validity could be established due to the sampling method of the items, which are based on beginning learners’ DSGS teaching materials (e.g., Boyes Braem, 2004). Due to the absence of any other DSGS vocabulary test to investigate concurrent validity, external variables (number of DSGS courses attended, self-judgment of DSGS skills, context(s) of DSGS learning) were identified and their contribution to the test takers’ performance on both tests investigated. The variable “number of courses attended” correlated significantly (rs between .528 and .601, p <.05) with the test scores on the YN test and the translation test, separately for both raters. The correlation between the YN test results and the translation test scores was statistically significant for both raters (r = .811 and .867, p < .001).

Strengths and weaknesses
Among the strengths of these two vocabulary size tests for DSGS is that it is the first time that vocabulary size tests for adult sign language learners have been developed and evaluated. Among the weaknesses is that both tests were only conducted with a small sample of 20 test takers. Additionally, the test has been evaluated within a framework of classical test theory. In the future, more modern approaches to test validation should be applied (e.g., Chapelle, 1999).


AUTHOR

Summarized by Tobias Haug (2019).

 

For more information regarding this test, please contact Tobias Haug at Sign Language Assessment Services, Zurich.