Development of the Video Analysis Scale of Engagement (VASE) for people with advanced dementia [version 1; peer review: 2 approved with reservations]

Background: The current study sought to develop a valid, reliable and unobtrusive tablet computer-based observational measure to assess engagement of people with advanced dementia. The Video Analysis Scale of Engagement (VASE) was designed to enable the rating of moment-by-moment changes in engagement during an activity, which would be useful for both future research and current residential care. Methods: An initial version of the VASE was tested. Face validity and content validity were assessed to validate an operational definition of engagement and develop an acceptable protocol for the scale. Thirtyseven non-professional and professional volunteers were recruited to view and rate level of engagement in music activities using the VASE. Results: An inter-class coefficient (ICC) test gave a high level of rating agreement across professionals and non-professionals.  However, the ICC results of within-professionals were mixed. Linear mixed modelling suggested that the types of interventions (active or passive music listening), the particular intervention session being rated, time period of video and the age of raters could affect the ratings. Conclusions: Results suggested that raters used the VASE in a dynamic fashion and that the measure was able to distinguish between interventions. Further investigation and adjustments are warranted for this to be considered a valid and reliable scale in the measurement of engagement of people with advanced dementia in a residential care setting.


Introduction
Dementia is a growing challenge for nations worldwide (Alzheimer's Society, 2015;World Health Organisation, 2018). Non-pharmacological interventions (NPIs) are thought to be a meaningful and preferred approach to care for people living with a dementia (PWD) (Cabrera et al., 2015). These interventions include, for example, cognitive stimulation therapy (Orrell et al., 2017), singing (Camic et al., 2011), music and art therapies (Deshmukh et al., 2018;Raglio et al., 2015), talking therapies (Cheston & Ivanecka, 2017), and others. In earlier studies, the evaluation of whether these interventions were successful often relied on self-report measures or staff observations of behavioural and psychological symptom changes; factors such as social interaction and engagement had not been considered as important factors or useful outcomes in dementia care (Du Toit et al., 2019;Sung & Chang, 2005). In recent years, there has been an emphasis on promoting wellbeing for PWD through social engagement within their immediate environments (Martyr et al., 2018). Such an approach recognizes that focusing solely on emotional and behavioural outcomes to determine if an intervention is successful might pose a danger where interventions are "prescribed" to PWD based on those outcomes, without considering the individual's personal choice.
Beard, Knauss & Moyer (2009) found that people with dementia (PWD) continued to want an enriched life after being diagnosed. Having positive attitudes and engaging with physical, mental and social activities are thought to be helpful to maintain an enriched life (Beard et al., 2009). Kitwood's (1997a) person-centered approach highlighted the importance of maintaining personhood in dementia care, stressing the need to promote social inclusion to maintain identity for the individual. The maintenance of inclusion and identity could be achieved through self-directed support in dementia care (Mental Health Foundation, 2011), such as placing a strong emphasis on the importance to respond to needs "in the moment" (Dunne, 2002).
Understanding the choices and needs of people at advanced stages of dementia is not always clear as deterioration in memory, difficulties in communication and impairments in daily activities increase as the condition progresses. Consequently, being able to express basic needs and wants becomes more difficult and PWD perspectives become "lost" (Kumar & Kuriakose, 2013;Lai, 2015). Providing appropriate care that is meaningful for the PWD relies on staff and carers' understanding and familiarity with the individual, which is enhanced by careful observation of daily interactions (Mental Health Foundation, 2011).

Engagement
One way to assist staff and carers to understand PWD's preferences is to consider a person with dementia's level of engagement with particular activities. Engagement, as a form of social interaction, is thought to be an important factor in determining the effectiveness of interventions in their ability to promote meaningful activity (Jones et al., 2015). One of the key aspects of person-centred care is the recognition that all human life is grounded in social relationships (Brooker & Latham (2016)).
Therefore, advanced dementia care should focus on creating a rich social environment to foster personal growth by maintaining engagement, relationships and activities appropriate to the level of impairment (Mitchell & Agnelli, 2015). Other than human interaction, Cohen-Mansfield et al., (2009) suggested that engagement could refer to "the act of being occupied or involved with an external stimulus" (p. 2); this idea suggests that being engaged not only means connecting with people, but also with other stimuli, such as objects, music and activities.
Engagement within a social context is often determined by an individual's participation in the activities of a social group (Prohaska et al., 2012). Zhang, Jiang, & Carroll (2011) suggested that engagement means that the member stays in the group and interacts with others. Perugia et al. (2018) defined engagement for those with dementia as wellbeing, enjoyment and active involvement triggered by meaningful activities. For the purpose of this study, engagement is conceptualized as a state of wellbeing and involvement triggered by participating in activities within a group.
During the advanced stages of dementia, when language skills often deteriorate, residential care is sometimes a consideration (Herrmann & Gauthier, 2008). The environment of residential care can create difficulties in day-to-day social life, challenge a sense of identity, reduce self-confidence and as a result engagement in daily activities diminishes over time (Clair et al., 2005;Hubbard et al., 2002). Reduced engagement can lead to boredom, loneliness and depression (Cohen-Mansfield et al., 2009); it is therefore essential for people with dementia to participate in activities that promote positive social interactions in residential care. Research further suggests that engaging in meaningful and enjoyable activities can lead to a better quality of life (Smit et al., 2016), fewer behaviour problems (Braun, 2019) and increased positive emotions (Kolanowski et al., 2014).
It is important to note that a lower level of engagement in particular activities should not be seen as a symptom or a lack of ability, but rather, may be an indicator of having a strong sense of self (Sabat, 2006). Individuals rejecting participation in an activity might be indicating an ability to advocate and express needs. Sabat (2006) proposed that "self" remains even in the advanced stages of dementia and there are three forms of self, each with different attributions. The most vulnerable self for people in the advanced stage of dementia is a "publicly presented persona that requires the cooperation of others". To protect this part of the self, Sabat (2006) stressed that carers should provide good quality interactions that support relationships and the role of the individual. Sabat's (2006) theory is a further development of Kitwood's concept of personhood focusing on person-centred care, which identifies activities to stimulate engagement at advanced stages of dementia. As such, gauging engagement with this population becomes an important care issue.
Methods to measure engagement A review conducted by Curyto et al. (2008) concluded that observational measures are the "gold standard" in assessment for older people at the advanced stages of dementia (Kawulich, 2012). Observation techniques have previously been used to investigate the process and interactions in dementia care (Curyto et al., 2008;Gaugler et al., 2013). Observations are thought to be able to gather meaningful data about day-to-day functioning that might be missed by standard questionnaires (Algar et al., 2016).
Limitations to the traditional observational measures include requiring observers to be present during a given session, which is costly and labour-intensive, as well as potentially creating stress that could affect group interactions (Carthey, 2003). Less intrusive and cost-effective alternatives could address these concerns. Robert et al. (2010) noted that the existing observational engagement measures are difficult to use, even for professionals. Their review considered 68 available assessment scales in Alzheimer's disease for various purposes and concluded that there was a need for an easy-to-administer scale for identifying response to therapy in daily practice (Robert et al., 2010). The existing observational protocols are time-intensive to administer and frequently unable to monitor direct carer-PWD interactions that reflects person-centred care (Gaugler et al., 2013). For example, the Observational Measurement of Engagement (OME) has been validated to examine the engagement of PWD in interventions such as music (Cohen-Mansfied et al., 2009). This follows a complex protocol requiring formal training and a substantial amount of time to learn. Its complexity potentially reduces the accessibility of such measures for clinicians and care staff. Consequently, staff carers might struggle to find appropriate measures to assess whether particular activities are engaging to people at the advanced stages of dementia. Furthermore, the available engagement measures constitute either a single score system or an average score from a certain period (e.g. 30 seconds to 1 minute). However, human interactions are "dynamic", changing "momentby-moment", which this method of measuring might miss. Therefore, there is a need to create a flexible measure that allows those doing the rating (henceforth "raters") to capture these dynamic changes as and when they observe them.

Video-based observation
Video-based observation offers a good alternative as it was developed to be an unobtrusive method, minimising disruption to the social setting through the presence of researchers and observers. This method can also capture multiple, complex interactions simultaneously while gathering a larger amount of data than traditional observational methods (Asan & Montague, 2014. A further benefit of using video analysis with a severely impaired dementia population is that it allows both researchers and care staff to closely view group interactions, meanwhile facilitating an examination and understanding of subtle behaviours occurring within the group (Asan & Montague, 2014). Video analysis can also enable the use of raters from the wider social system (e.g. lay people, family carers and non-dementia experts), therefore permitting a more comprehensive perspective of care. A wider perspective of care, including family's views and support, is thought to be important in dementia care (Moore et al., 2012).

Music as an intervention
The rationale to use music activities as a basis for developing an observational measure was based on music being seen as a useful intervention across all stages of dementia (Alzheimer's Society, 2018;Abraha et al., 2017). Both live and recorded music interventions have also been widely used in community and residential care settings (Clare & Camic, 2019). These interventions have been found to increase quality of life Vasionytė & Madison, 2013), as well as promoting other physio-psychosocial outcomes in dementia (Cooke et al., 2010). Most importantly, such interventions have been found to improve levels of engagement (Eggert et al., 2015) and are therefore a suitable nonpharmacological intervention for the purpose of this study. Music activities are thought to engage by promoting relaxation or creating a sensory stimulant, which have been thought to increase alertness, reduce agitation and improve quality of life for those across different stages of dementia (Cohen-Mansfield et al., 2009;Strøm et al., 2016).
Music interventions are of particular importance for people at the advanced stage of dementia, and although language ability has deteriorated, musical recognition abilities are relatively preserved (Baird & Samson, 2015). This lends support to the theory of individualised music intervention for agitation (IMIA), which suggests that music acts as a medium for communication for those who have an impaired ability to understand verbal language (Clare et al., 2019;Gerdner, 2012). Currently, one widely used music program in the United Kingdom (UK) for advanced dementia is Music for Life (MfL) (Music for Life, 2014). MfL is an approach that brings together professional musicians, care staff and people living with dementia through interactive music to enhance their quality of life (Rose, 1993).

Aim and objectives
An effective and easy-to-administer measure that assesses engagement in advanced dementia has yet to be developed. As part of the Created Out of Mind project (Brotherhood et al., 2017) the present study aimed to develop a non-intrusive, easily used video-based observational tool to assess level of engagement during a music intervention (MfL). The development of the measure, the Video Analysis Scale of Engagement (VASE), was carried out in two stages: the protocol development stage and the validation stage. This paper presents the development of the VASE and reports its psychometric properties in terms of validity and reliability analyses. The research question, objectives and hypotheses of the study are presented in Box 1.

Box 1. Project objectives and hypotheses Primary research question
Can an observational rating tool effectively measure the engagement of people with advanced dementia in a music activity?

Overall objectives
To develop an observational rating tool with a user-friendly operational protocol to measure the level of engagement of people with advanced dementia. The intervention used for observing the participants' level of engagement is a musicbased activity (Music for Life, MfL) in residential care.

Objective 1.
Identify an operational definition of engagement for the Video Analysis Scale of Engagement (VASE) and determine its face and content validity.
Objective 2. Determine inter-rater reliability and intra-rater reliability of VASE by comparing groups of raters (nonprofessionals and dementia care professionals). If the scale is found to be reliable, assess the validity of the scale to differentiate engagement by comparing two conditions, namely the MfL and passive listening (PL) groups.
Objective 3. Assess whether other variables, including age, gender, ability to play a musical instrument, participation in a choir in the past or the present, types of session (MfL versus PL) or the order of videos raters rated (order effect), might affect the engagement score.

Hypotheses
Hypothesis 1. Related to objective 2 above, there will be a significant correlation between the VASE rating by dementia experts and non-professionals.
Hypothesis 2. Related to objective 2, there will be a correlation between the VASE rating among the dementia experts as well as that among the non-professionals.
Hypothesis 3. Related to objective 2, the rating tool will be sufficiently sensitive to differentiate the engagement level between the two music-based activities (MfL versus PL).
Hypothesis 4. Related to objective 3, extraneous variables will not affect the rating.

Design and procedure
This study adopted a mixed-methods design (Creswell et al., 2003) and was separated into development and experimental stages. In the development stage a qualitative method was used to identify an operational definition of engagement, followed by developing a protocol for the VASE. Face validity and content validity were then assessed. Face validity is the appropriateness, sensibility, or relevance of the tool and its items as they appear to the persons answering the test (Holden, 2010), whereas content validity investigates whether the scale adequately covers the content that it should, with respect to the variable (i.e. engagement) (Heale & Twycross, 2015). The experimental stage followed a quasi-experimental design to investigate the detection and rating of different levels of engagement in an active group (MfL) and a passive listening (PL) group. The reliability of the VASE was then assessed. Criterion validity was considered to compare how well the VASE measured engagement against another validated measure (Heale & Twycross, 2015). This validity test was not carried out for two reasons. Firstly, there were no tools available that measured moment-by-moment level of engagement. Secondly, if the VASE was to compare with other "non" moment-by-moment engagement tools, such as OME, participants would have to pause the VASE rating every 15 seconds to complete another rating, making the duration of the experiment much longer, more complex and more difficult for participants to focus on watching and rating the video itself.

Video analysis scale of engagement (VASE).
The VASE is an offline application written in the hypertext markup language (HTML) (Lai et al., 2020b) and developed as a computer tablet-based program consisting of three main parts: 1) a 500 mm × 400 mm viewing box that allows raters to view a preloaded video; 2) a seven-point Likert type scale (from 0 (low engagement) to 7 (high engagement) to record changes in engagement while viewing the video; and 3) an exact time stamp to determine time of rating relative to the start of each video. VASE adopts a continuous scoring system using a seven-point Likert-type scale to assess level of engagement whilst viewing a video segment. Responses and time of the responses are automatically recorded by the software. The VASE enables raters to respond in real time by simply tapping a scale on the screen without the need to stop the footage to record ratings.

Development stage
Face validity. A preliminary version of the VASE (Lai et al., 2020b) was field-tested with six healthy adult volunteers from the general public. An opportunity sample was used where volunteers were approached by the researchers and invited to participate. Volunteers were recruited from the general public in Hong Kong, the Hub at the Wellcome Collection, Dementia Research Centre, University College London and Salomons Institute, Canterbury Christ Church University. The volunteers were asked to watch one of two YouTube videos preloaded onto the VASE app. Both video clips were around three minutes in duration and consisted of a musician delivering a music intervention to a person with advanced dementia. Feedback was sought using open-ended questions on the usability of the app, whether explanations about what behavioural expressions were regarded as engagement were adequate, and brief descriptions of their decision-making process during rating. Verbal feedback was incorporated into the earlier version of the measure and recorded in a questionnaire to further refine and revise the VASE. Using Braun & Clarke's (2006) six-stage approach of thematic analysis, common patterns and categories about what engagement of people with dementia looks like were identified (Table 1).

Content validity.
Following open-ended feedback from the six volunteers, thematic analysis results were reviewed by four interdisciplinary experts (musician practicing music within residential care, neuropsychologist, clinical health psychologist and a trainee clinical psychologist), in order to revise and create a protocol that would more accurately reflect the rating of engagement in the VASE. After the protocol was completed and adjustments made the final version (Lai et al., 2020b) was further tested by two volunteers, from the Dementia Research Centre, University College London, who were verbally approached to participate.

Experimental stage
The experimental stage was conducted in two parts. The first part involved recording active and passive music sessions (MfL and PL, respectively) involving people with advanced dementia living in a residential care home. The second part involved recruiting raters to rate brief video excerpts from the music sessions using the VASE. Each part of the study, the setting, participants and procedures, are described below.

Part 1
Setting. Video recording of MfL and PL sessions were made in a London care home.
Participants. Eight PWD participated. Recruitment criteria: (i) a confirmed diagnosis of dementia; (ii) Clinical Dementia "there's a lot of duplication. Because when they would said to you, "okay, this man has been accepted, could you please do a referral", and we are all using information on the same system. So we end up doing the same thing again" "We're having these discussions with psychology in the pod meeting they're quite often quite like in-depth, which is great, but then if they're accepted there and then in the meeting for psychology, the clinician, like the care coordinator then has to go away and type out the conversation. But we've already had the conversation with psychology, so could the referral not just be accepted there and then, without the paper part being done?" 4 Playing an instrument "Playing and focusing on the instruments in front of her" Moving along with music "Her feet was tapping (along with the music)" "She closed her eyes, but it looks like she was enjoying the music." "Looking up but thinking about things but she seemed relaxed" 2 Sad or anxious look (appear agitated, e.g., eyes down casted like in moment of unhappiness; tapping his/her fingers as in people who are anxious) "I can see that the elderly was sad…but it does not mean she was not enjoying the music right? Maybe it made her remember something" "But I suppose negative emotions like looked anxious and sad can mean that the person is (PWD) is engaging with the therapy (intervention) right ?)" 2 PWD, person with dementia.
Rating Scale (Morris et al., 1997) score of 2-3 (advanced) as rated by care home staff; (iii) aged 60 or above; and (iv) able to sit in a room for an hour in a group setting. PWD that had (i) a clinical dementia rating of below 2; (ii) significant hearing difficulties that cannot be corrected, even with a hearing aid; or (iii) disruptive behaviour during group activities in the care facility (e.g. aggressive behaviour) were excluded. These criteria were screened by care staff at the care home and verified by one of the researchers.

The intervention (MfL) and passive listening (PL) conditions.
The MfL programme is one type of music-based intervention used for advanced dementia. It is an interactive music programme that was designed to promote better quality of life for people with dementia in residential care (Music for Life, 2014); it took place for one hour a week over a course of eight weeks. Each week specially trained musicians facilitate and attempt to establish and enhance communication with the PWD through improvisational music and activities. The PL session was held once, whereby participants listened to pre-recorded music that was similar to that used in the intervention sessions and played by the same musicians. The settings, day and time of day, length of session, number of musicians and care staff present at both intervention and control conditions were equivalent.
Procedures. The experimental stage first involved a one-hour control session. This was followed one week later at the same time and in the same location by the start of the eight-week MfL intervention. At the beginning of each session, a 360-degree Fly video camera ® was placed in the middle of the room; this camera, which is smaller than a tennis ball, captures continuous 360-degree recording, making it ideal for use in groups. Videos were edited by an independent video editor into 30-second segments, and 12 from the control and 4-5 from each of the intervention sessions were chosen pseudo-randomly. In each segment one of the eight participants was randomly chosen for the raters to specifically focus on and this was indicated with a yellow arrow ( Figure 1). The 48 segments from the control and intervention sessions and PL session were then edited into a single 25-minute-long video.
The order of the clips in the video was also pseudorandomised. Two more videos were made in the same manner as the first one, where the order of the clips was again pseudo-randomly assigned to remove potential order effects.

Part 2
Setting. The video clips were then viewed using the tablet computers by the raters and scored in secure and non-public places. This included university and research organisations meeting rooms, a public library private meeting room and a charity's office.
Participants. Opportunity sampling was used to recruit professional and non-professional raters in Hong Kong and the UK through emails and face-to-face contact. Six professionals were included as raters (clinical and neuro psychologists, nurses and dementia charity managers). The inclusion criteria were as follows: (i) work in a health-related discipline; (ii) aged 18 or over; and (iii) one or more years of experience working in dementia care. Thirty-one people from the general public (non-professionals) aged 18 and above, who had not worked clinically with dementia, were also recruited.

Sample size.
In accordance with the Medical Research Council (MRC) framework (2000), this was a feasibility study. Lancaster, Dodd & Williamson (2004) recommend an overall sample size of 30 for feasibility studies. Table 2 presents the demographic characteristics of the raters.

Procedures.
Once consent forms were signed, the raters were requested to provide demographic information and were given the protocol describing the categories of engagement to read. Raters were then given a password-protected tablet and over-ear headphones to complete a series of rating scales while watching the video recording of the MfL group. Research personnel were present to supervise this process and answer questions during all video viewings. Inter-rater reliability was tested to establish the consistency of the final version of the VASE.

Data analysis
Raw data from the VASE, including the time and ratings were recorded. The scores were then rounded to the nearest second. Data was then transferred into SPSS version 23 for analysis.
Reliability. An intra-class correlation coefficient (ICC) was used to determine the consistency of coefficients across all raters. This was also used to assess the engagement scores across the three videos (different clip order: M1, M2 and M3), the conditions (PL and MfL session) and the raters (profession and non-professionals). Spearman's correlation was used to evaluate the inter-correlations between the professionals, as well as between the non-professionals.

Mixed model analysis.
Multilevel linear modelling (MLM) was used to investigate the relationship between distribution of rating and average rating over time on raters' characteristics. The analysis included the rating provided by raters overall, video conditions (MfL versus PL), gender, videos watched (different clip order: M1, M2 and M3), professional or non-professional, with or without experience in looking after PWD, and experience of musical instrument and singing group. Since one participant would appear in multiple clips and they were rated by each rater at these different time points, MLM takes account of the dependencies by estimating variance associated with group (e.g. raters) and differences in average response (intercepts). The model adopted for data analysis in this study regards intercepts and/or slopes to be random effects. All of the analyses were conducted using SPSS Version 23 and the level of the significance was set to 0.05.

Ethical considerations
Ethics approval was granted by Canterbury Christ Church University, Salomons Institute Ethics Panel (approval number: primary caregiver, who had the legal authority to give consent. Before each video recording, centre staff and musicians would also verbally remind the participants that the sessions were being recorded for the purpose of this research and offer them the opportunity to withdraw from the recording; none withdrew. Raters were reminded they would be watching a recording of a vulnerable population and that they should not disclose the name or identifying information of people viewed in the video; all signed consent forms and agreed to abide by these requests. Video recordings were transferred from the camera directly to encrypted files in a password-protected tablet computer and the video data removed from the camera.

Development stage
Face validity. Based on the feedback from the volunteers watching the YouTube videos, some adjustments were made; this included adjusting the size of the videos to 850mm × 500mm and adjusting the font size. The volunteers reported finding it difficult to distinguish the different types of engagement. Some volunteers also expressed that watching the same video three times made them lose interest and, as a result, they found it difficult to concentrate during the repeat viewings. Consequently, it was decided that the VASE should be used to rate an overall state of engagement. After making the revisions on the app, a further field-test was carried out with five more volunteers to ensure that the final version of the VASE was suitable (Figure 1).
In addition, based on the questions related to aspects that made raters consider a PWD to be engaged in the group, a thematic analysis was carried out using the interview data gathered from volunteers. The interviews were transcribed and four main categories (patterns) and 12 behavioural expressions of the categories were finalised (Table 3).

Content validity.
Four main categories were derived for the initial protocol based on the qualitative data analysis. The four categories included: a) facial expressions; b) bodily movement and verbal articulations; c) attention and awareness of activity; and d) emotional responses. The initial protocol was much briefer, offering little description of the categories. There was 100% agreement from the experts, indicating that the categories and their corresponding behavioural expressions are a good representation of engagement in people with advanced dementias. However, the experts also highlighted the importance of offering some examples of behaviours relating to each category. They specifically considered that explanations should include behavioural expressions that were not easily picked up as a sign of engagement. For instance, one of the experts spoke about PWD experiencing what other people might describe as "negative" emotions, such as sadness, where they might be considered to be emotionally "moved" by the music. The experts further commented that there is a need to highlight that sometimes eyes being closed, or even a neutral look, can be a sign of a person being engaged. In addition, the professional's rating would also be dependent upon different cues, such as the context of the situation, the rater's own experience, and the rater's understanding of the group. Consequently, some experts proposed that the protocol V:\075\Ethics\2016-17) and also approved by the charity's review panel (approval number: V1a28617) where the research was conducted. The study followed National Institute for Health Research guidelines for working with people who are unable to directly provide informed consent. An information sheet was given to caregivers and consent obtained from the participants' higher ratings for MfL than PL sessions. These higher ratings were observed irrespective of the order in which the videos were viewed and rated (M1, M2 or M3) or their professional/nonprofessional status (Table 4). It should be noted that the mean MfL ratings shown in Table 4 and Table 5 are mean scores of MfL sessions 1 to 8 (Figure 2). Differences between MfL and PL mean ratings suggest that the inter-class correlation should be analysed separately for MfL and PL sessions.

Reliability
Inter-rater reliability. In our study, the ICC was used to measure the consistency of raters in the rating of the engagement of the target participants over 48 video clips. According to Koo and Li's criteria (2016), values less than 0.5 are considered to have poor reliability, 0.5 to 0.75 are considered to have moderate reliability, 0.75 to 0.9 is considered to have good reliability and ICCs greater than 0.9 are considered to have excellent reliability. The findings accept hypothesis 2 and suggest that there is a orrelation between the VASE rating among the dementia experts, as well as among non-professionals.
The mean rating scores of the raters across videos (M1, M2 and M3) were assessed ( Spearman's correlation coefficients between the professionals were evaluated. The results show that the coefficients ranged from 0.353 to 0.72 (Table 7). In our study, there was a high correlation between experts E and F, which indicated that they had a high level of agreement in their ratings. The correlation coefficients among experts B, E and F were greater than 0.5, which indicates that they had a moderate correlation. The correlation between experts D, E and F, as well as that between experts A, B and C was also moderate. On the other hand, experts A and C had a weak correlation with D, E, and F. All the correlation coefficients indicated a statistically significant degree of correlation. Correlation coefficients between the non-professionals were should not be a rigid manual; rather, it should simply provide a reference for what engagement is and allow a certain amount of ambiguity and openness towards a rater's own interpretation. The appropriateness of the rating would be determined in the third stage of inter-rater reliability testing. To see if there was consistency and agreement, different raters' scoring of the same individual at the same time were statistically tested to examine if variances existed. Consequently, a statement about there being no right or wrong answer was added to the brief description. This process resulted in the final version of the VASE protocol.

Experimental stage
Observing raters during the viewing sessions, and from informal comments made by the raters, it was apparent that they had a variable delay in making their rating as each new video appeared and they appraised the scenario. As a result, the first five seconds of each clip were excluded from the analysis of the rating data gathered. In total, each rater watched 1,200 seconds of clips, of which 300 seconds were PL and 900 seconds were MfL sessions. The VASE was tested across 37 raters. The mean rating scores of each individual rater under MfL and PL conditions are reported in Table 5. A non-parametric Mann-Whitney U test was used to analyse whether there were any differences in rating between the two conditions by each rater. The results showed that most of the raters (36 out of 37 raters) gave significantly     also calculated, with results showing that 89.5% (416 out of 465) of the correlation coefficients between raters are significantly correlated at 5% level (Table 8). In general, the findings support hypothesis 2, indicating that there is a correlation between the VASE rating among professionals, as well as non-professionals.
MLM was used to investigate the relationship between the rated level of engagement and a number of variables, including condition type (active [MfL]/passive [PL]), session number (1-8), within-video "sequence" (1-5), and rater characteristics, such as age, gender, professional status (professional/non-professional), experience of playing an instrument, experience of singing in a choir, presence in one of the MfL sessions, and experience of PWD (all were dichotomous 'yes/no' ratings).
In our hypothetical model, MfL or PL condition, sessions, engagement across the five "period", raters' profile such as age, gender, profession, playing an instrument and experience of singing in a choir, and experience with dementia were entered as fixed effects (without interaction term), while rater and PWD recorded in the videos were entered as random effects, which is based on the hypothesis that there would be a difference in the relationship between the level of engagement and raters, as well as the PWDs.
A full model (-2 log likelihood = 167030.821) that includes all of the variables is significantly better than one in which only the intercepts are included (-2 log likelihood = 173760.602), with λ 2 (10, N = 44400) = 6729.781, p < 0.001. Thus, inclusion of all variables improved the model beyond that produced by considering variability in raters and participants. This significantly lower level of chi-square in the full model justified the use of MLM. Among the 10 predictors selected for the model, half of them were significantly associated with the level of engagement (Table 9). These five variables were period (β = 0.12, p < 0.005), MfL or PL condition (β = -1.43, p < 0.005), session (β = 0.02, p < 0.005), age (β = -0.03, p < 0.005) and profession (β = 0.97, p < 0.05).   Level of engagement was significantly associated with five of the factors in the model (Table 9): 1. Sequence (β = 0.12, p < 0.0): ratings differed significantly across the five "period", and on average went up by 0.12 per period, indicating that raters' ratings were changing over time.
2. MfL or PL condition (β = -1.43, p < 0.005): Engagement was rated significantly higher for the MfL sessions than the PL session, on average 1.43 points on the rating scale.
3. Significant differences across sessions (β = 0.02, p < 0.005): Ratings were recorded as higher at the latest session than the earlier session, with on average 0.02 points difference per session, which might not be clinically significant.
5. Professional raters (β = 0.970, p < 0.05): Professional raters provided significantly higher ratings. Table 10 shows the random effects of the model. It was found that there was significant variability in the ratings given by different raters (p < 0.001), as well as significant variability in the rating of the PWD between raters (p < 0.001). There was also significant residual variance after taking into account all effects in the model. This residual variance might indicate that the model requires more variables. The residuals of the model were tested with a Q-Q plot: it was found that the residuals followed a normal distribution and it was thus concluded that the normality assumption of the model is supported.

Discussion
The aim of this feasibility and validation study was to develop an easily accessible and user-friendly engagement measure for use in the assessment of engagement of people with advanced dementia. The study helped better understand engagement in people with advanced dementia by creating an operational definition of the concept. It also showed the possibility of adopting VASE as a measure for the aforementioned use, particularly in group settings. It also demonstrates the potential utility of continuous, multi-rating scales that capture dynamic changes in a rateable concept like engagement from momentto-moment. In the study, face validity was obtained based on the opinions of volunteers from the general public. Thematic analysis of interview data was used to construct an operational definition of engagement. Inter-rater-reliability was documented, and strong agreement was found in some conditions.
Hypothesis 1 testing the correlation between VASE rating by dementia professionals and non-professional people is accepted. ICC indicated that VASE has good to excellent agreement between the two samples of professionals and general public.
Yet, the MLM suggested that overall, professionals generally provided significantly higher rating values than the general public (average of 0.97 Likert scale points). This could be a result of professionals having more clinical knowledge and understanding about engagement of people with advanced dementia than the general public, greater awareness of the challenges PWD experience in their capacity to respond verbally and physically, or altered perceptions of the significance of subtle behaviours.
Hypothesis 2 tested inter-rater reliabilities and is accepted. When looking at inter-rater reliability across professionals, high to moderate ICC were found in general with some inconsistent results between raters. This inconsistency might be the result of experts working in different fields and different settings. For example, those who had higher agreement with each other were psychologists and clinicians working in community settings who have experience running groups for PWD. The others work in more acute hospital settings; consequently, they have less experience of residential group interventions. As a result, the professionals working in acute wards might potentially have different understandings of what engagement looks like for PWD than those working in residential care. Indeed, a person's belief system, worldview and reality are often constructed based on their experience (Koltko-Rivera, 2004). This was difficult to account for and capture in the VASE and it raises questions about whether the current VASE is overly reductionist in capturing such a complex concept as engagement. On the other hand, high agreement was found from non-professionals, further supporting the reliability of VASE among non-professional raters.
Hypothesis 3 tested whether VASE will be able to differentiate the level of engagement between MfL and PL, and is accepted. The ICC and non-parametric testing results suggested that the VASE has good to excellent reliability in differentiating MfL from PL sessions. These results echo the findings from MLM where raters generally rated the control condition an average of 1.44 Likert scale points lower. This result is perhaps unsurprising as active MfL music activities are more dynamic and interactive than passive music listing, with participants tending to react and respond to each other and the musician(s) also making the music using different musical instruments. Hypothesis 4 assumed that the extraneous variables would not affect rating. Hypothesis 4 is rejected. The MLM findings suggest that ratings differ significantly across periods (five seconds). This suggests that the level of engagement rated in the video changes over time and raters are using the VASE in a dynamic fashion. Another promising result was that the order in which raters rated the segments (M1 versus M2 versus M3) had no effect on rating value. This is consistent with the between-group ICC result and suggests that the VASE rating scale is not affected by order effects.
The rater's age seemed to have an effect on the rating score. During data collection some participants aged in their 70s and above expressed that they found it difficult to use the tablets and a 25-minute video was too long for them. Cornish & Dukette (2009) stated that the average maximum attention time for adults is around 20 minutes, therefore, future research should consider the optimal length of time to use the VASE. Apart from familiarity with technology and attention span, other factors such as decline in processing speed (Eckert et al., 2010) and response selection time (Woods et al., 2015), have been found to be associated with ageing. To overcome this issue, it is worth considering using tests for reaction time commonly used in computerised neurocognitive tests to learn about raters' baseline reaction times (Schatz et al., 2015), or revise the program in such a way that permits a longer processing duration or to pause and rewind.

Strength and weaknesses
The present study developed a new rating scale and examined its validity. This study explored the possibility of using "nonsymptomatic" concepts such as engagement to understand PWD's response in interventions. Engagement with others and involvement in activities are important for various dimensions of health and wellbeing for those with dementia (Benveniste et al., 2012;Robinson et al., 2013). A validated scale for the assessment of engagement will be useful to researchers and clinicians to better understand the effect of interventions for PWD and those who might have difficulties verbally expressing themselves. The examination of the processes during an intervention is crucial in helping professional and family carers to learn about the participants' responses, and to gauge clinical benefits. VASE is therefore a useful measure in that sense. When such a measure is user-friendly and publicly available, it will have wider applicability to all those who are involved in dementia care.
Secondly, the study made a unique contribution as it was the first known study to capture moment-by-moment changes of engagement that take place during the intervention, enabling raters to continuously make ratings as they observe changes in the video. The VASE is also non-intrusive and it does not require raters to be present due to the use of a previously recorded video. Videos can be reviewed and re-rated again by the same viewer or different viewers, enabling multiple raters to crosstrack their engagement scores. Most importantly, the VASE can record the exact time that changes in engagement occur during a group intervention. This allows raters to know which particular activities stimulate different levels of engagement for particular individuals. This could potentially enable clinicians or carers to tailor specific activities for PWD in order to promote person-centred care. This scale might also allow family members who are living at a distance to be involved in suggesting activities for their family member via a live link or through pre-recorded video.
Moreover, the engagement scale was complemented by the decision to use a 360-degree camera to record group sessions. This permitted a simultaneous and much wider view in identifying engagement and interaction between participants in a group than single point of view recordings or direct observation by a single observer present at the session.
One of the limitations of VASE is that this was a feasibility study and the sample size of raters was relatively small. Furthermore, as sampling was opportunistic, there is the possibility of sampling bias (Stasser & Titus, 1985), and people who participated in this experiment might have different attitudes and understandings about engagement from those who did not participate in the study.
Criterion and construct validity were also not established. It was not possible to establish criterion validity as the VASE is a single item continuous rating scale. However further research could determine construct validity to enable better confidence that VASE is operating theoretically as expected (Gaugler et al., 2013). Test-retest reliability was also not established due to funding and time constraints. Further, rating participants on two music group conditions (PL and MfL) does not ensure the results are generalisable to others.
Lastly, during the discussion on the use of VASE, some experts proposed that in order to be able to fully understand engagement in a session, it is necessary to be present in the group and be "immersed in that atmosphere", yet others disagreed with this. Some experts also commented that ratings of engagement might be subjective and based on the raters' understanding of, and familiarity with, the subject area they are rating. As a result, there might be factors that this study had not considered. As this is the first cycle in its development, further work is needed before we can be confident in its reliability.

Research implications
Future research needs to review the current version of VASE and investigate possible adjustments of the measure, such as trialling with ratings based on different types of engagement and reviewing the duration of the rated video segments. Further research should also investigate the difference between family members and professionals in their understanding of engagement of PWD living in care homes. The protocol could then be revised to consider these differences. With a larger sample size different validity and reliability tests could be used, such as criterion and construct validity. Test-retest reliability could also be considered with the same rater re-rating the video.
The VASE adopted a seven-point Likert rating; seven-point ratings have been previously recommended as a good multipoint scale in preference to a five-point scale (Lewis, 1993). However, the results of the mixed model suggest that the mean difference (standard estimate) between conditions, such as session and conditions (control and intervention) was small; further examination on the sensitivity and specificity of the VASE is needed to better understand the statistical and clinical significance of outcomes.
In the MLM, residual (unexplained) variance reached a significant level. This suggests there are currently other variables that have not been considered and more variables need to be incorporated in order to build a better model (Heck & Thomas, 2000). Variables such as mood, cognitive ability, attitude towards the activity could be added. Raters' awareness of dementia and attitudes towards dementia (Handley et al., 2017) and their age and baseline reaction time could also be investigated.
Further research could examine the use of VASE in other interventions regarded as beneficial for PWD such as cognitive stimulation (Orrell et al., 2017), art therapy (Deshmukh et al., 2018) and other types of music interventions (Clare & Camic, 2019).

Dementia care implications
The mixed findings suggest that further refinement of the VASE is needed before it can be used in dementia care settings. Engagement as an outcome was not previously considered a worthwhile construct to measure in a dominant medical model of dementia care but it is now deemed valuable with person-centred (Kitwood, 1997b;Sung & Chang, 2005) and relational (Greenwood et al., 2001) approaches. Unlike symptom-based tools that measure the success of the intervention the VASE offers an alternative way of understanding dementia, investigating choice and interaction. With revision, the VASE could be adopted in residential care settings to help understand levels of engagement with various different activities. This could enable care staff to assess whether particular activities are suitable for individuals. As VASE provides a time stamp in terms of noted changes in engagement, it could also assist group facilitators to identify particular activities or stimuli that support higher levels of engagement and those that support less. This could enable facilitators to adjust their intervention based on the group or the individual's preference. Facilitators could also take the measurement and results to other experts, and even clinical supervision, in order to make changes that would benefit PWD.
As a video analysis measurement that is easy to use and userfriendly, the VASE could potentially be beneficial for inviting a wider support network to engage in the care of the individual. For example, it could enable family carers to view and evaluate activities in which PWD participate in a care home without needing to be present in the group, allowing families the option of becoming more involved in care.
Lastly, if the measure is found to be valid, it could potentially be used for staff training, where examples of engagement and non-engagement can be identified and shared. Most importantly, the new observational tool could also enable us to gain a better understanding of the particular nuances and components of what makes an activity useful for this population, and potentially this could be applied to evaluations of other types of interventions and activities (e.g. museum object handling, approaches to self-care, family interactions). Dynamic rating scales beyond engagement could also be adopted, applying them to other concepts in dementia care where observation of dynamic changes is required.

Conclusion
The feasibility and validation study results indicate that the current version of the VASE has good reliability in some areas. It still needs further investigation and adjustments for it to be a valid and reliable tool in measuring engagement of people with dementia in a group setting within residential care. Balancing the wish to develop a user-friendly measurement tool and to capture a complex abstract concept such as engagement is challenging. It is encouraging that there is some evidence suggesting that the VASE is able to distinguish between the level of engagement of participants in two different types of music activities (passive and active). With further adjustments and investigation, the VASE could be a useful measure in advancing dementia care. Improvements in assessments of the processes during an intervention will facilitate a better capturing of the concept of engagement, and would eventually benefit carers in the promotion of wellbeing of those who are affected by dementia.

Underlying data
The raw data analysed in this research are based on a video analysis from a multi-session music group for people with advanced dementia. Family members who are legal guardians for the participants in this study provided written consent for their participation as did the residential care facility where data was collected. Due to the sensitive nature of video data showing full frontal identification we are not able to make the videos available in an open access format. Permission to do so was not given by the ethics review panels, as well as being inappropriate to do for a vulnerable group of research participants.
If researchers are interested in using the video data from this study they are asked to contact the corresponding author (p.camic@ucl.ac.uk) describing the nature of their interest, the intended use of the data, plans for obtaining ethical approval at their respective institution and signing a confidentiality agreement to assure protection of data whilst in their possession. When ethical approval has been provided as evidenced by a signed letter from the ethics panel, data will be transferred via overnight carrier in an encrypted external hard drive at the requester's expense. Once received, video data may not be downloaded onto other computers but can only be accessed from the external password protected and encrypted hard drive, which must be returned at the end of its use to the corresponding author.

Software availability
Source code available from: https://github.com/PaulCamic/Video-Analysis-Scale-of-Engagement-VASE-for-people-with-advanceddementia/tree/VASE-source-code Archived source code at time of publication: https://doi.org/ 10.5281/zenodo.4001025 (Lai et al., 2020c) License: Creative Commons Zero v1.0 Universal analyses as we have done them and it is not clear to us what is meant by "only minor information" is included. Apologies, but we are not sure what is lacking? Would you like us to add additional tables and figures as extended data?
Thank you for your further assistance.
The authors, Daniel Lai, Sebastian Crutch and Paul Camic