Advertisement

Exploring data- and knowledge-driven methods for adaptive activity learning with dynamically available contexts

  • Jiahui Wen
  • Jadwiga IndulskaEmail author
  • Mingyang Zhong
  • Xiaohui Cheng
  • Jingwei Ma
Regular Paper
  • 82 Downloads

Abstract

Various aspects of human activity recognition have been researched so far and a variety of methods have been used to address them. Most of this research assumed that the data sources used for the recognition task are static. In real environments, however, sensors can be added or can fail and be replaced by different types of sensors. It is therefore important to create an activity recognition model that is able to leverage dynamically available sensors. To approach this problem, we propose methods for activity learning and activity recognition adaptation in environments with dynamic sensor deployments. In our previous work, we proposed sensor and activity context models to address sensor heterogeneity and also a learning-to-rank method for activity learning and its adaptation based on the proposed context models. However, most of the existing solutions, including our previous work, require labelled data for training. To tackle this problem and further improve the recognition accuracy, in this paper, we propose a knowledge-based method for activity recognition and activity model adaptation with dynamically available contexts in an unsupervised manner. We also propose a semi-supervised data selection method for activity model adaptation, so the activity model can be adapted without labelled data. We use comprehensive datasets to demonstrate effectiveness of the proposed methods, and show their advantage over the conventional machine learning algorithms in terms of recognition accuracy.

Keywords

Activity recognition Knowledge-driven Data-driven 

Introduction

Recognizing activity (Zhan et al. 2014) is of great importance in a variety of applications including health related applications (dietary monitoring (Zhou et al. 2015), daily routine understanding (Sun et al. 2014), personal assistant (Lukowicz et al. 2015), abnormal behaviors detection (Riboni et al. 2015) etc.). Nearly all of the traditional activity recognition approaches are based on static data sources, as data sources are assumed unchanged in the training and testing phases. The activity recognition systems created with the pre-defined data sources are not able to leverage the dynamically available data sources to potentially improve the recognition accuracy. However, many researchers have extensively demonstrated that additional information sources can significantly benefit activity recognition. For example, Zhan et al. (2014) show that incorporating vision features from an additional camera can improve the recognition accuracy for human activities in accelerometer-based activity recognition, particularly for static activities (e.g. sitting and writing) that are difficult to recognize based on only inertial sensors. Riboni and Bettini (2011) use additional location context for activity recognition, as some activities can only be performed in specified semantic location (e.g. tooth brushing in bathroom) and therefore the location information can filter out the plausible activities. Other research also shows that additional information such as vital signs, objects (e.g. cup) in home setting, audio features (Maekawa et al. 2010) can also improve the activity recognition performance.

In addition, it has been shown in the previous work that sensors used for long-term activity monitoring can often fail or need to be updated (Gonzalez and Amft 2015). It is therefore very important that the activity recognition system is able to evolve in order to adapt to the changing data sources it uses for acitivity recognition, and that it can take advantage of the evolving deployment of sensors. The difficulty is that training an activity model used by such an activity recognition system is not a trivial task. The reasons are twofold: (i) To automatically incorporate dynamically discovered sensors it is necessary to deal with heterogeneity of sensors and the data they produce due to a variety of sensor types and sensor modalities. For example, data produced by the same type of sensors may need to be interpreted differently when used for different purposes. (ii) While domain knowledge can be used to specify the interrelations between the activities and the contextual information provided by the discovered sensors, this domain knowledge cannot obtain optimal results as people perform activities in a variety of ways (Zhou et al. 2015; Wen et al. 2015b).

In our previous work (Wen et al. 2016), we proposed a high-level activity recognition framework that is able to integrate dynamically available sensors upon their discovery. Furthermore, we proposed a data-driven method for incorporating dynamically available contexts with the learning-to-rank method and temporal regularization.

In this paper, we extend our previous work and make the following contributions:
  • We propose a knowledge-based method for activity recognition and activity recognition model adaptation with dynamically available contexts. The interrelations between contexts and activities can be mined from an external knowledge base (e.g. websites), and we can use this knowledge for creating activity models and also updating activity models in an unsupervised manner when new context sources are available.

  • We propose a semi-supervised data selection method for activity recognition model adaptation, so that when new context sources become available, the activity models can be updated without labelled data. In our previous data-driven method (Wen et al. 2016), annotated data is required to adapt the activity recognition model to the new dataset (including information provided by the newly available context sources).

  • We validate the proposed data-driven method using one of the most complex human activity recognition datasets, OPPORTUNITY, and demonstrate the advantage of the proposed data-driven method over traditional personalized activity recognition methods. We validate the proposed knowledge-driven method using both the OPPORTUNITY dataset and a simulation dataset.

The reminder of this paper is organized as follows, In Sect. 2 we review the related work. In Sect. 3, we give a brief description of the modelling of sensors and activities that is based on our previous work (Wen et al. 2016), and also include the necessary concepts and definitions. The knowledge-driven and data-driven method are described in Sects. 4 and 5 respectively, followed by their performance evaluations in Sect.  6. Finally, we conclude the paper in Sect.  7.

Related work

There are many methods for recognizing human activities in data-driven activity recognition. Supervised methods learn activity models with annotated data, however labelling large amounts of activity data is time-consuming and expensive. Semi-supervised methods learn activity models with small amount of data, and retrain the models with unlabelled instances classified with high confidence. Generally, semi-supervised methods are proposed to deal with the problem of data scarcity. They can also be used to approach activity model adaptation that adapts the activity model, created with data of large number of users, to a specific person given his/her activity data. Examples include Stikic et al. (2009) and Reiss and Stricker (2013).

Unlike these methods, we consider activity recognition and activity models adaptation with dynamically available contexts. To achieve this, we need to discover sensor modalities and pre-process sensor data for activity recognition task, and perform adaptive learning on the newly acquired context to obtain optimal recognition performance.

Many previous works leverage external knowledge to deal with previously unseen sensor data for recognizing activities. For instance, Gu et al. (2010) interrelate activities and contexts (e.g. object usage) with knowledge that is mined from publicly available web pages. Therefore, the prior knowledge can be used for activity recognition without labelled data. Tapia et al. (2006) substitute the conditional probability of an unseen object on the activities with linearly combined conditional probabilities of existing similar objects. The similarities are measured through WordNet. In Wang et al. (2007), define the probabilities between the activities and contextual information (e.g. objects, actions) with common sense human knowledge. In some other works such as Van Kasteren et al. (2010), the authors leverage domain knowledge to transfer activity models from one domain to the others, so that the unseen contextual information in one domain can be used for activity recognition in the others (e.g. different smart houses). The aforementioned methods generally utilize common sense knowledge or external knowledge for recognizing activities in an unsupervised manner, hence they are not customized for activity recognition and activity models adaptation with dynamically available contexts. In addition, as people perform activities quite differently (Wen et al. 2015b; Zhou et al. 2015), the aforementioned methods cannot obtain accurate recognition given activity data of different users. The difference between these works and ours lies in the fact that, we can flexibly define the relations between activities and contexts using external knowledge. By associating each relation with a weight, and learning the weights from the activity data selected without supervision, we can personalize the activity model to a specific user and achieve higher recognition accuracy.

Context modelling

Sensor context modelling

A variety of sensor types are used for activity recognition in previous work, such as body-worn sensors, object sensors, and ambient sensors (Roggen et al. 2010). Discovering new sensors and populating their sensor data into the activity recognition system can potentially benefit the recognition task, as adding new sensors provides more context information. However, the data of dynamically discovered sensors need to be preprocessed into the context information required by the recognition system. For example, for a binary sensor that is used for object interaction monitoring, the output value of the sensor directly indicates whether the user is interacting with the object or not. However, for inertial sensors (e.g. accelerometer, gyroscope) that are used for the same purpose, the continuous sensor values may need to be pre-processed into a proper feature vector to indicate the interaction. To approach the problem of sensor heterogeneity, we model the necessary information of the sensors so that we know how to process the sensor readings into proper representations when the sensors are dynamically discovered.

By sensor modelling we capture the necessary information about the sensors such as the sensor type, sensor reading type. From among the various context modelling methods, we adopt the fact-based approach from Henricksen and Indulska (2004) to model the sensors. This fact-based context modelling method was created to model both low level and high level context (e.g., various situation types including activities) but can also be used to model sensors. This was already shown by Hu et al. (2008) who used the fact-based context model to model sensors for autonomic mapping between the abstract context required by context-aware applications and sensors providing raw sensor data. The goal was to replace one type of sensor by a different sensor that can produce the same abstract context. Our goal for sensor modeling is different as we need to model the sensor metadata that informs how to pre-process the raw sensor readings into a proper representations, so that they can be integrated in activity recognition and activity model adaptation.
Fig. 1

Sensor modelling

Fig. 2

Example of an activity model

Figure 1 shows an example of a sensor model that models the necessary context information about a sensor. It models the sensor type (e.g. on-body sensor, ambient sensor), sensor reading type (e.g. continuous, discrete), a model for pre-processing sensor readings into abstract context, location (e.g. kitchen), attached to (e.g. cup). The IEEE 1451 standard describes standard sensor interfaces through which sensors can be queried upon discovery and can present information about themselves. Based on this query a discovered sensor can be associated with its sensor context model. The information in this model provides the guideline for pre-processing the sensor readings into high-level context (e.g. interaction with objects) for activity recognition. For example, given a sensor that produces binary output, ambient sensor (e.g. motion sensor) may indicate the location context while object sensor implies object usage. The pre-processing of sensor readings into the abstract context requires a pre-processing model that is different for different kinds of sensors. The pre-processing of sensor readings used in our approach is described in Sect.  6.

Activity context modelling

Modelling sensors facilitates the sensor readings pre-processing for the dynamically discovered sensors, while modelling of activities with abstract context is used for recognition of a particular human activity. The activity context models used for activity recognition can be viewed as the patterns of the context types and values that define activities, and activity recognition is performed by matching these activity models with the context information pre-processed from the sensor readings. Another motivation for activity modelling is that we can leverage domain knowledge for activity recognition to remove the requirement of data labelling. The idea is that the contextual information used to describe high-level activities is human readable, and hence common sense can be used to correlate the contexts and activities as the starting point (Alam et al. 2015) (e.g. ’cooking’\(\Rightarrow\)’kitchen’ AND (’walking’ or ’standing’)).

We present an example of activity modelling in Fig. 2. The activity is described by various context fact types, and each fact type is associated with a probability that specifies the possibility of observing the given context in this particular activity. The probabilities can be obtained in many different ways, such as domain knowledge (Van Kasteren et al. 2010; Wang et al. 2007) or external sources (Gu et al. 2010; Tapia et al. 2006). We further associate each probability with a weight, and the rationale of introducing the weight is twofold. First, as the probabilities between the activities and contexts represent general knowledge, we need to introduce the weights to obtain personalized activity models (i.e., show how important is this particular context for an activity recognition of a particular person). Second, classification margins between activity classes may change due to the additional contexts introduced by the dynamically available sensors, learning the weights with selected data can adjust and maximize the boundaries between activities in the presence of new sensor data. For example, by mining knowledge from websites (Gu et al. 2010), we know the probabilities of sugar are almost the same in activities Make tea and Make coffee. However, a specific individual may use sugar heavily in Make tea and infrequently in Make coffee. Therefore, the weight of sugar in Make tea needs to be increased for this person activity recognition to indicate the important role it plays in the activity. In the later sections, based on the sensor and activity context models, we formulate a machine learning problem and propose methods to learn the weight matrix.

We perform activity model adaptation with dynamically available sensors by learning the weights in presence of new sensor data. In Fig. 2, we present such an adaptation in the grey part with dash lines. When a sensor is dynamically available, we pre-process the sensor readings into proper high-level representations (i.e., contexts) according to the sensor context model. The high-level context information is populated into the activity model and adaptation of activity recognition is performed with selected activity data. For example, when a sensor is dynamically discovered and queried, we analyse the corresponding sensor context model, and we learn that it is an accelerometer attached to a cup and it produces continuous readings. Therefore, it is a sensor used for object usage monitoring, and the sensor readings need to be pre-processed (e.g. by clustering) to indicate when the context cup is observed or not. The context provided by this sensor is populated into all the activities (e.g. Make tea, Make coffee) that relate to the context cup. The corresponding activity models are adapted by learning the weights of the context with selected activity data.

Problem definition

In this section, we introduce the concepts and definitions used in this paper and then formally define activity recognition as a classification problem.

Let \(L=\{(x_i,y_i)\}_{i=1,\ldots , |L|}\) denotes the set of labelled activity instances with \(x_i\) being the ith feature vector and \(y_i\in \{1,2,\ldots ,C\}\) being the corresponding activity class. C indicates the total number of activity classes. A feature vector \(x_i\) is the aggregation of contextual information using a sliding window, and it is formally defined as a \(N-\)dimension binary vector \(x_i=\{x_i^1,\ldots ,x_i^N\}\) with \(x_i^j\in \{0,1\}\) indicating whether jth context is observed in the sliding window or not (Van Kasteren et al. 2008). The problem is: how to recognize the set of testing instances \(x\in \{0,1\}^{1\times (N+d)}\), given the set of training data L, where d is the number of dynamically available contexts. In the later sections, we describe how to learn context weights for activities, followed by the activity recognition adaptation to the newly available context.

Let \(P\in {\mathbb {R}}^{C\times N}\) be the probability matrix with \(P_{kj}\) defining the probability of jth context given kth activity. Matrix P can be mined from the knowledge database (Perkowitz et al. 2004), learned from the labelled data (Gu et al. 2009) or even manually defined.

Knowledge-driven method

In this section, we describe how to leverage an external knowledge base to create activity models and perform the activity model adaptation with dynamically available sensors in an unsupervised manner. High-level activities are usually characterized by different kinds of contexts, (e.g. “making sandwich” can be described by location context “kitchen”, and object contexts “knife” and “bread”). Moreover, there exists some descriptive texts that specify the instructions of how to perform high-level activities. Therefore, contexts characterizing the activities can be extracted from the texts using natural language processing methods, and then we can calculate the parameters of the contexts with respect to different activities. Finally, with the parameters we are able to create activity models and perform activity model adaptation using the Bayesian framework. In this light, dynamically available sensors are incorporated into the activity recognition framework automatically. In what follows, we first describe the mining process of the probability matrix P from third party sources, followed by the activity modelling, prediction and activity model adaptation based on the probability matrix.

Knowledge base

In this section, we describe how to mine the knowledge (i.e. context-activity conditional probabilities) from the websites, www.wikihow.com and www.ehow.com (Perkowitz et al. 2004; Wyatt et al. 2005). Both of these two websites describe how to perform daily activities and involved contexts. The basic idea of this knowledge-driven method is that, the probability of observing a context in an activity is related to the probability of the textual representation of the context appearing in the textual description of the activity. We first crawl the websites and get the descriptive documents for each target activity class, and then identify the contexts involved in each activity using the natural language processing method. Finally, we calculate the context-activity conditional probability of each context with respect to different activities. The mining process can be described by the following steps:
Fig. 3

Search for activity description

  • Search the two aforementioned websites for the target activities. As illustrated in Fig. 3, the website lists multiple superlinks that redirect to the webpages that describe how to perform the activities step by step. We automatically crawl all the pages for each target activities. As we search for the target activities in the same website, the webpages that describe the activities have the same html schema. This makes it feasible to automatically crawl the textual descriptions.

  • When we get the textual descriptions for the target activities, natural language processing methods are used to extract the interesting contexts from the text. The processing of the texts from the webpage goes through the following pipeline: tokenization, part-of-speech (POS) tagging, lowercase, stemming, WordNet filtering. We first tokenize the sentences in the texts into a list of single words so that they can be further processed by later phases. At the second step, we tag the tokenized words with part-of-speech tags. Since the contexts involved in the activities are nouns, we only select those words tagged with “NOUN” for further analysis. We then change the capital letter into lowercase and stem the morphological variants of a word that have the similar meanings to their stemmed or root forms (e.g. standing-stand, bottles-bottle). The rationale behind these two steps is that words that have different meaning or variants should have the unique representation in our case. Finally, since the contexts involved in the activities are objects or substances in the physical space, we used the knowledge base WordNet for filtering. In WordNet, each word has its hypernyms, and the relations between the word and its hypernyms follow the “is-a” relationship (e.g. coffee is-a [beverage, tree, seed, brown]). For each word, we walk through its hypernyms paths, and the word is categorized as an object or a substance if the word “object” or “substance” reside in any of its hypernyms paths. Figure 4 shows that “coffee” is classified as an object as there are multiple hypernyms paths walking through “substance”.

  • After the processing phases, we get thousands of contextual terms, some of them are not discriminative and not useful for the activity recognition task. In this step, we propose to find the top-k most important contexts for each activity class. Specifically, we calculate the term frequency-inverse document frequency (tf-idf) of each context term with respect to the activity classes as the measurement of the discriminative power, and choose the contexts for each activity class based on this measurement.
    $$\begin{aligned} tf-idf_{c,y}=\frac{n_{c,y}}{\sum _cn_{c,y}}\cdot log\frac{|\{d\}|}{|\{d:c\in d\}|} \end{aligned}$$
    (1)
    where \(n_{c,y}\) is the number of occurrences of context c in activity class y. \(|\{d\}|\) is the total number of collected texts describing different activity classes, and \(|\{d:c\in d\}|\) is the number of texts where context c appears. The first term \(\frac{n_{c,y}}{\sum _cn_{c,y}}\) denotes the frequency of the context in a specific activity class. If the context appears frequently in an activity class y, then the whole term is larger, meaning that probability of observing the context is higher in this activity class. The second term log \(log\frac{|\{d\}|}{|\{d:c\in d\}|}\) is the inverse document frequency for the context c. This is used to punish the context that is universal and appears in almost all documents, as it provides little discriminative power.
  • Finally, we calculate the context-activity probability of those selected contexts with respect to different activity classes based on the processed descriptive texts. Specifically, we calculate the context-activity probability with the Naive Bayesian method. Let P(c|y) be the context-activity probability (i.e. probability of context c occurring in documents that describe activity y), let \(n_k(c)\) be the number of texts that describe activity class \(y=k\) in which context c is observed; and let \(N_k\) be the total number of texts of that activity class. Then we can estimate the parameters of the context likelihood as,
    $$\begin{aligned} P(c|y=k) = \frac{n_k(c)}{N_k} \end{aligned}$$
    (2)
    the relative frequency of documents of activity class \(y=k\) that contain context c. In practise, we use a small superparameter \(\alpha\) for smoothing.1
    $$\begin{aligned} P(c|y=k)=\frac{n_k(c)+\alpha }{N_k+|\{c\}|\alpha } \end{aligned}$$
    (3)
    where \(|\{c\}|\) is the total number of contexts. Table 1 shows examples of some activity classes and the related contexts with high probabilities.
Fig. 4

Example of hypernyms path

Table 1

Examples of context-activity conditional probability

1

Make coffee

2

Make tea

3

Make pasta

4

Make oatmeal

 

Coffee

0.93

 

Tea

0.89

 

Pasta

0.85

 

Bowl

0.69

 

Water

0.69

 

Water

0.87

 

Water

0.66

 

Mix

0.62

 

Cup

0.68

 

Cup

0.69

 

Salt

0.61

 

Oatmeal

0.55

 

Sugar

0.45

 

Sugar

0.50

 

Oil

0.58

 

Sugar

0.49

 

Pot

0.36

 

Leaf

0.43

 

Sauce

0.58

 

Oat

0.48

Activity modelling

We use the Bayesian framework to formulate the activity model, and enforce the Markov smoother on the neighbouring activity instances to encourage the same activity to be continued to avoid accidental misclassifications. Therefore, we model the joint distribution of the observed activity feature vector sequence x and the latent activity sequence y.
$$\begin{aligned} P(\mathbf x ,\mathbf y )=p(y_1)p(x_1|y_1)\prod _{i=2}^I p(y_i|y_{i-1})p(x_i|y_i) \end{aligned}$$
(4)
By assuming the independences among the different contexts, we can have:
$$\begin{aligned} p(x_i|y_i) = \prod _{n=1}^N p(x_{i,n}|y_i) \end{aligned}$$
(5)
where N is the total number of contexts that are currently available. In practice, we found that Bernoulli Naive Bayes performs better than others such as Gaussian and Multinomial Naive Byes. Therefore, the decision rule is:
$$\begin{aligned} p(x_{i,n}|y_i) = p(c_n|y_i)x_{i,n}+(1-p(c_n|y_i))(1-x_{i,n}) \end{aligned}$$
(6)
where \(x_{i,n}\) is a binary value, indicating the presence of the nth context in the ith instance as described in Sect.  3.3, and \(p(c_n|y_i)\) is the conditional probability of nth context given activity class \(y_i\), as described in Sect.  4.1. If context \(c_n\) is present, then \(x_{i,n}=1\) and the required probability is \(p(c_n|y_i)\). Otherwise, the required probability is \(1-p(c_n|y_i)\). Therefore, Bernoulli Naive Bayes also considers the non-occurrences of the contexts.
In Sect. 4.1, we described how to leverage the external source to create the knowledge base that specifies those conditional probabilities, so that when we dynamically discover new contexts we can use those probabilities in the knowledge base for activity recognition. Suppose that there are d contexts dynamically available, then the emission probability needs to be updated to incorporate the new contexts with the probabilities from the knowledge base:
$$\begin{aligned} p(x_i|y_i)=\prod _{n=1}^Np(x_{i,n}|y_i) \prod _{n=N+1}^{N+d}p(x_{i,n}|y_i) \end{aligned}$$
(7)
Activity prediction is equivalent to finding the latent activity sequence that is able to maximize the joint distribution, and this can be solved with Viterbi dynamic programming.

Activity prediction

Now that we have the emission probabilities (e.g. context-activity probabilities) from the knowledge base, we still need the transition probabilities among the activity classes so that we can infer the latent activity sequence on the sequence of the context observations. We manually set the transition probabilities with domain knowledge, similar as in previous work (Wu et al. 2007; Wang et al. 2007). The basic idea is that a human usually carries out activities for a certain amount of time, and current activity is more likely to be continued in the next time slice. Therefore, the self-transition probabilities are much higher than the probabilities of transiting one activity class to a different one. We experimentally set the self-transition probabilities to be 0.9 for each activity class, as we proved in Wen et al. (2015a) that this setting is able to achieve sufficient high accuracy:
$$\begin{aligned} p(y_i|y_{i-1}) = {\left\{ \begin{array}{ll} 0.9 &{}y_i=y_{i-1}\\ \frac{1-0.9}{C} &{}\text {otherwise} \end{array}\right. } \end{aligned}$$
(8)
where C is the number of activity classes as specified in Sect.  3.3.
Given a sequence of context observation \(x={x_1,x_2,ldots,x_m}\), the latent activity classes can be estimated by finding the corresponding latent activities of those observations, so as to maximize the joint distribution \(p(x_1,x_2,\ldots ,x_m,y_1,y_2,\ldots ,y_m)\). To solve the problem, we define the forwarding variable,
$$\begin{aligned} \alpha _j(i)= & {} \underset{y_1,y_2,\ldots ,y_{i-1}}{max}p(x_1,x_2,\ldots ,x_t,y_1,y_2,\ldots ,y_i=j) \end{aligned}$$
(9)
$$\begin{aligned}&s.t.\quad 1\leqq i\leqq m \end{aligned}$$
(10)
$$\begin{aligned}&j\in \{1,2,\ldots ,C\} \end{aligned}$$
(11)
to be the highest possibility of the ith observation being activity j, with respect to the previous \(i-1\) latent activity classes. Maximizing the joint distribution is equivalent to solving \(\underset{j}{max}\alpha _j(m)\). By inducing iterative relationship between the forwarding variables:
$$\begin{aligned} \alpha _j(1)= & {} p(y_1=j)p(x_1|y_1=j) \end{aligned}$$
(12)
$$\begin{aligned} \alpha _j(i+1)= & {} (\underset{k}{max} \alpha _k(i)p(y_{i+1}=j|y_i=k))p(x_{i+1}|y_{i+1}=j) \end{aligned}$$
(13)
in each iteration, we choose the activity class that maximizes the forwarding variable as the prediction:
$$\begin{aligned} y_{i+1} = \underset{j}{argmax}\ \alpha _j(i+1) \end{aligned}$$
(14)

Data-driven method

The previous section describes how to mine knowledge from external sources for activity modelling and prediction. One of the advantage of this method is that training data is not required for the parameters learning, and the activity model can be modelled and adapted with dynamically available contexts without supervision. However, activity learning and adaptation with common sense general knowledge is usually not able to achieve high accuracy, due to the fact that people perform activities differently. Therefore, the activity model needs to be personalized to a specific user for high recognition performance achievement. The personalization process takes the activity data of a specific user as input, and employs the data-driven machine learning method for learning the parameters of the activity models. The basic idea of machine learning is to find the parameters that minimize the empirical error given the training data, so it is also expected to minimize the testing error given the assumption that the user activities remain consistent during a short period of time.

To do this, we extend the activity model in Fig.  2 and context-activity conditional probability is further associated with a weight, shown in Fig. 2. The rationale of introducing the weight is twofold. The first one, as described previously, is that the weight can be used for personalizing the activity model and obtaining optimal recognition accuracy (i.e. show how important is a particular context for an activity recognition of a particular person). Second, since margins between activity classes may change due to additional context information provided by newly discovered sensors, learning the weights from the context data will provide activity recognition adaptation to adjust the classification margins. For example, by mining knowledge from websites (Gu et al. 2010), we know the probabilities of sugar are almost the same in activities Make tea and Make coffee. However, a specific individual may use sugar heavily in Make tea and infrequently in Make coffee. Therefore, the weight of sugar in Make tea needs to be increased for this person activity recognition to indicate the important role it plays in the activity. In the later sections, based on the sensor and activity context models, we describe how to perform activity recognition, learning and activity recognition model adaptation.

Activity recognition and learning

The basic idea of activity recognition is that for each feature vector, we calculate a score against each of the activity class with the parameters, and choose the class that has the maximum score as the prediction:
$$\begin{aligned} prediction = argmax_{y}W_{y}\cdot (logP_{y}\circ x_i)^T \end{aligned}$$
(15)
where \(P_y\) are the conditional probabilities of all the contexts on class y, and \(W_y\) are the weights associated with \(P_y\).
The weight matrix W needs to be learned from the data, and we draw on the idea of learning-to-rank to formulate the learning problem. Notice that we choose the activity class with the maximum score as the prediction. Therefore, for each feature vector, the score of the ground truth is supposed to be (ranked) higher than that of the others:
$$\begin{aligned} \begin{aligned} r(y_i,x_i)&>r(y,x_i)\\ s.t.\ (x_i,y_i)&\in L,y\in \{1,\ldots ,C\},y\ne y_i\\ r(y_i,x_i)&=W_{y_i}\cdot (logP_{y_i}\circ x_i)^T+b_{y_i}\\ r(y,x_i)&=W_{y}\cdot (logP_{y}\circ x_i)^T+b_y \end{aligned} \end{aligned}$$
(16)
where \(b_y\) is the displacement variable we introduce for the case that an activity class is barely described by any contexts. Solving the above inequality is equivalent to maximizing the value of the Area Under the ROC Curve (AUC) which is commonly used in classification problems. Generally the larger the value of AUC is, the more the correct activity class ranks higher than the others.
$$\begin{aligned} \begin{aligned}&max\,\, \sum _{(x_i,y_i)\in L}\sum _{y\ne y_i}log(\sigma (r(y_i,x_i)-r(y,x_i)))\\&-\frac{\beta _1}{2}\sum _{(x_i,y_i)\in L}\sum _{j\in N(i)}(r(y_i,x_i)-r(y_i,x_j))^2\\&-\frac{\beta _2}{2}\sum _{i=1}^C(||W_i||^2+||b_i||^2) \end{aligned} \end{aligned}$$
(17)
where \(\sigma (x)\) is the sigmoid function: \(\sigma (x)=\frac{1}{1+e^{-x}}\), and \(\sum _{(x_i,y_i)\in L}\sum _{j\in N(i)}(r(y_i,x_i)-r(y_i,x_j))^2\) is the temporal regularization that encourages neighbouring feature vectors to have the same predicted activity classes.

Activity model adaptation

We perform learning to rank to obtain more accurate activity models by weighting each context. Let \(Q=\{(x_j,y_j)\}_{j=1,\ldots ,|Q|}\) be the selected adaptation dataset, where \((x_j,y_j)\) is the jth instance in the adaptation dataset that contains dynamically available d contexts: \(x_j=\{x_j^1,\ldots ,x_j^N,\ldots ,x_j^{N+d}\}\) with \(x_j^{k}\in \{0,1\}\). The method for selecting the adaptation dataset is described later in this section. With the adaptation data, the object function in Eq. (17) can be reformulated as follows,
$$\begin{aligned} \begin{aligned} max&\sum _{(x_i,y_i)\in L}\sum _{y\ne y_i}log(\sigma (\hat{r}(y_i,x_i)-\hat{r}(y,x_i))) \\&+\sum _{(x_j,y_j)\in Q}\sum _{y\ne y_j}log(\sigma (r(y_j,x_j)-r(y,x_j)))\\&-\frac{\beta _1}{2}\sum _{(x_i,y_i)\in L\cup Q}\sum _{j\in N(i)}(r(y_i,x_i)-r(y_i,x_j))^2\\&--\frac{\beta _2}{2}\sum _{i=1}^C(||W_i||^2+||b_i||^2) \end{aligned} \end{aligned}$$
(18)
where
$$\begin{aligned} \hat{r}(y_i,x_i)=W_{y_i,[0:N]}\cdot (logP_{y_i,[0:N]}\circ x_i)^T+b_{y_i}\ ,\ (x_i,y_i)\in L \end{aligned}$$
(19)
The object function can also be maximized with SGD (Bottou 2010). During the learning process, the parameters (i.e. Wb) will be iteratively adjusted to discriminate one activity from another, and a certain activity class will allocate a large weight to the context that is important to the activity and a small weight to the less discriminative context. In this light the activity model adaptation process is able to automatically determine the useful context. Notice that even though we only query a small set of the activity instances for retraining, we can still leverage the unlabelled instance for temporal regularization (3rd term in Eq.18).

Instances selection

In this subsection, we introduce the method to select the instances for classifier retraining and adaptation. The instances contain dynamically discovered context, and the proposed activity recognition model is able to automatically incorporate the new context if it is discriminative enough. In this way, the proposed model can be self-adapted or -refined. We perform instances selection after the belief propagation for the sake of selecting the informative and profitable instances to quickly converge the classifier without human intervention.

Belief propagation

As new sensors that provide new contexts are dynamically discovered, we need to select instances that contain the new sensor data to adapt the proposed activity recognition model. The aim in this stage is to leverage belief propagation to smooth the outliers and rectify the results produced by the proposed model, so as to select the most profitable and informative instances to learn the new context and adapt the activity model. Due to the temporal characteristic of human behaviours, the current activity is more likely to be continued in the next time slot. Therefore, there are strong correlations among the sequential predictions of the instances.
Fig. 5

Belief propagation between hidden variables

Belief propagation is mainly performed for inference in graphical models, and in the form of message passing between the nodes. The messages passed among the nodes are actually exerting influence from one variable to the others. In this light, the belief propagation is to send messages to the connected node and tell it what it should believe (Yedidia et al. 2005), and the hidden state of a node depends not only on local observations, but also the product of all incoming messages from locally connected nodes. Upon convergence, the marginal distribution of the variable nodes can be approximated with:
$$\begin{aligned} p(y_k|\varvec{X}) = \frac{\phi _f(y_k)\prod _{f'\in N(k)\setminus f}\mu _{f'\rightarrow k}(y_k)}{\sum _{y_k'}\phi _f(y_k')\prod _{f'\in N(k)\setminus f}\mu _{f'\rightarrow k}(y_k')} \end{aligned}$$
(20)
where \(\phi _f(y_k)\) is the local evidence, and \(\mu _{f'\rightarrow k}(y_k)\) is the message from neighbouring factor nodes of node k, as shown in Fig. 5.

In our scenario, the belief propagation is performed among the observation nodes and hidden nodes. The observation node at time t is the feature vector collected from the sensor data while the hidden node is the latent activity. Since the latent activity is unknown, the latent variable \(y_k\) is represented in the form of a multinomial distribution over all the activities. The multinomial distribution is iteratively updated by incorporating the messages from not only local observations, but also adjacent nodes.

In our framework, we only consider pairwise connections (Fig.  6) between the hidden nodes when performing belief propagation. Therefore, the messages that a node receives are the posterior probabilities of its neighbouring nodes based on their own local observations, as shown in (21)
Fig. 6

Belief propagation in our scenario. The solid lines show the messages received by node k from neighbouring four nodes

$$\begin{aligned} p(y_k|\varvec{X}) = \frac{p(y_k|x_k)\prod _{i\in N(k)\setminus i:y_i=y_k}p(y_i|x_i)}{\sum _{y_k'}p(y_k'|x_k)\prod _{i\in N(k)\setminus i:y_i=y_k'}p(y_i|x_i)} \end{aligned}$$
(21)
Therefore, belief propagation is performed with an inference step and followed by several iterative update steps. In the inference step, for each observation, the proposed model generates a posterior probability distribution over the hidden activities. In the propagation step, those initial estimations of posterior probabilities are propagated to neighbouring nodes. Those recipient nodes k then combine the received probability distribution over \(y_i\) together with its local evidence given by the proposed model and convert them into a distribution over \(y_k\), using Eq. (21). The iterative process can be repeated until convergence. In our experiments, we found that running belief propagation for only one iteration is sufficient to converge the posterior distribution.

The belief propagation is slightly modified in our implementation. As the instances classified with high confidence usually tend to be the correct classification, we do not update the posterior distribution for those high-confidence instances during the iterative process of belief propagation, so that their beliefs can be propagated to the uncertain instances.

Measurements

First of all, we introduce the measurements that can evaluate the profitability of an instance, so that based on those quantitative criteria, the instances can be selected to adapt the model. The first metric we consider is the “drift” in the posterior distribution before and after the belief propagation. Belief propagation is able to smooth out the outliers by exploiting the temporal information. Those instances that experience huge “drift” in their posterior distributions are much more valuable, since they are not modelled by the initial activity model and have a greater chance of residing near the classification boundaries. The Jensen-Shannon divergence can be used to measure the “drift”, as it has been proved to be efficient to measure the distance between two distributions in previous work (Sun et al. 2014). Supposing \(p_i\) and \(q_i\) are the posterior distributions of instance i before and after belief propagation, respectively, and then the JS-divergence is:
$$\begin{aligned} JS(p_i,q_i)=\frac{1}{2}D_{KL}(p_i||m)+\frac{1}{2}D_{KL}(q_i||m) \end{aligned}$$
(22)
where \(m=\frac{1}{2}(p_i+q_i)\) and \(D_{KL}(p_i||m)=\sum _jp_{ij}log\frac{p_{ij}}{m_j}\) is the Kullback-Leibler divergence between two distributions. Therefore, we derive the first measurement as:
$$\begin{aligned} score_{i1} = \frac{JS(p_i,q_i)-JS_{min}(p,q)}{JS_{max}(p,q)-JS_{min}(p,q)} \end{aligned}$$
(23)
We normalize the JS-divergence, so that the measurement based on the posterior distribution “drift” is always ranged in [0,1], in this way it is able to cater for characteristics of different activity data sets.
As for the second measurement of profitability, we consider the number of consecutive neighbouring instances that have the same predicted results.
$$\begin{aligned} \begin{aligned} N_i&= min(N_i^{forward}, N_i^{backward}) \\ score_{i2}&= \frac{N_i-min(N)}{max(N)-min(N)} \end{aligned} \end{aligned}$$
(24)
where \(N_i^{forward}\) and \(N_i^{backward}\) are the number of consecutive neighbouring observations that have the same predictions along the two directions of time series, from the current observation i. It is normalized due to the same reason as \(score_{i1}\). This measurement shows the extent to which the neighbouring nodes have the consensus predictions, and the higher the number, the more likely that the prediction is correct. Obviously, \(score_{i2}\) is proposed based on the temporal characteristic of human behaviour. One extreme condition is that the observation happens to be in the middle of an ongoing activity, and the \(score_{i2}\) tends to be large and it is more confident about the prediction.

Finally, we consider the confidence of the instances after the belief propagation. The posterior distribution itself provides the information about the confidence of an instance. Adding the instances with the highest confidence is equivalent to locating the class center, which in turn also helps to adapt the model to some extent, even though those instances are less informative. Therefore, the third measurement is formulated as \(score_{i3}=max(p(y_i|x_i))\) (Eq. (21)).

To decide which instance is more profitable, we need to take into account all the aforementioned metrics. Therefore, we determine the final score for the profitability of an instance based on the corresponding scores for each of the metrics. The combined score is defined as follows:
$$\begin{aligned} \begin{aligned} score_i = \alpha _1 score_{i1}&+\alpha _2 score_{i2} + \alpha _3 score_{i3} \\ s.t. \sum _{i=1}^3\alpha _i&= 1 \end{aligned} \end{aligned}$$
(25)
where the weights \(\alpha _i\) is manually given. In our method, we evenly distribute the importance to the three metrics by setting \(\alpha _1=\alpha _2=\alpha _3\). However, by giving different weights, the model may present different characteristics. For example, by increasing \(\alpha _3\) we give more weight to the high-confidence instances, and then the model adapts conservatively and the convergence is quite slow. By contrast, when we put more weight to \(score_{i1}\), the model only takes those instances whose posterior distribution changes dramatically before and after belief propagation, and then the adaptation is performed aggressively. There is a danger that noisy data may be added and the model is jeopardised.
However, different activity classes may have different distributions over the scores (e.g. Fig. 7). Therefore, to maintain class balance, we set different thresholds for different classes, so that for each class all the instances having the score higher than the threshold of that class are selected as adaptation data. The threshold can be set to certain percentile (e.g. 30, 50) of the distribution over the score for each class, so that class balance can be guaranteed.
Fig. 7

Distribution over the scores of different activities

Experiment

Public dataset

We validate the proposed methods using the OPPORTUNITY dataset (Roggen et al. 2010). The dataset contains activity data from 4 subjects when they perform Activities of Daily Living (ADLs) in a home setting. In total, 72 sensors, including 21 ambient sensors, and 14 object sensors, are deployed to monitor the activities with the sampling rate of 30/second. The activities of the user in the scenario are annotated on different levels, including locomotion (e.g. standing), gesture (e.g. opening), high-level activities (e.g. Coffee time). For object and ambient sensors, even though there are several sensors of the same type, they are used for monitoring different contexts. Therefore, there is 1-to-1 correspondence between contexts and sensor readings for those object and ambient sensors. For the other wearable sensors used to recognize low-level locomotions (e.g. standing), we treat them as a group and use their readings to recognise the low-level locomotive contexts (e.g. standing) and use recognition results as inputs for high-level activity recognition (e.g. Coffee time). The reason is that wearable sensors produce different readings when they are fixed in different body positions with different orientation. As a result, the sensor readings they produce do not have semantic meanings (in contrast to low-level activities), so it is impossible to apply domain knowledge on these readings.

Each of the four subjects (Subjects 1, 2, 3 and 4 are represented as S1, S2, S3 and S4) performs the ADLs for 5 runs. In each run, the subjects are instructed to perform the activities with a high-level script and are encouraged to perform the activities in an usual way with all the variations they are used to. In our demonstration of the effectiveness of the proposed method for recognizing high-level complex activities we use data for 3 subjects, not 4. We do not include the data of the 4th subject as rotational noises have been artificially added and therefore the sensor data does not represent the activity data captured in realistic scenarios (Cao et al. 2012). We use a sliding window of 5 seconds with 50% overlap to segment the streaming data, and the data description is presented in Table 2. The window length of 5 seconds is a tradeoff between delay and recognition performance, and examining the influence of the window length is out of the scope of this paper.
Table 2

Dataset description

Datasets

Activities (instances)

S1

Cleanup (283), Coffee_time (376), Early_morning (283), Relaxing (100), Sandwich_time (576), null (380)

S2

Cleanup (274), Coffee_time (273), Early_morning (216), Relaxing (120), Sandwich_time (749), null (290)

S3

Cleanup (205), Coffee_time (279), Early_morning (369), Relaxing (167), Sandwich_time (507), null (229)

Simulation dataset

Table 3

Simulation activity classes

1

Make coffee

7

Brush teeth

13

Clean table

2

Make tea

8

Wash clothes

14

Play PC games

3

Make pasta

9

Make orange juice

15

Watch TV

4

Make oatmeal

10

Watch DVD

16

Put on make-up

5

Fry eggs

11

Take pills

17

Use toilet

6

Make phone call

12

Read books

  

We also generate sensor data for the validation of the knowledge-driven method. We generate sensor data for commonly performed daily activities that are listed in Table 3. The generation of the sensor data is based on the context-activity probability matrix P. We assume that there is 1-to-1 correspondence between the sensors and context the same as in previous works (Wu et al. 2007; Wang et al. 2007; Perkowitz et al. 2004) (Maekawa and Watanabe 2011; Gu et al. 2010). Algorithm 1 describes the generation process.

The prior distribution of the activity classes is proportional to the number of descriptive texts that we can crawl from the websites for each activity class. In Algorithm 1, we first generate the class label \(Data_{i,-1}\) for ith instance, and then generate context presences for that instance. Specifically, for each context j, the presence of that context in activity class \(Data_{i,-1}\) is drawn from the Bernoulli distribution \(Bernoulli(P_{Data_{i,-1},j})\). We use the Bernoulli distribution for generating the context presence as it has been demonstrated in previous work (Meng et al. 2009) that the real sensor event distribution is Bernoulli distribution parameterized with the firing probability. Due to the assumption of the 1-to-1 correspondence between sensors and contexts, drawing the context presences is equivalent to generating the sensor firings (i.e. sensor events).

Validation of knowledge-driven method

In this subsection, we describe the validation of the knowledge-driven method for activity recognition and adaptation. Specifically, we demonstrate the possibility of incorporating dynamically discovered contexts for activity recognition. We first introduce the validation method, followed by an illustrative example, and finally the experimental results.

Validation method

For the OPPORTUNITY dataset, we do not recognize gesture contexts, as they are highly correlated with object contexts (Wang et al. 2007), and they are difficult to recognize based solely on the wearable sensors (Cao et al. 2012). Therefore, each instance consists of 27 features, including locomotion, object and ambient contexts. For the binary sensors, the produced values are used directly as features (i.e., pre-processing of sensor readings into context is not needed). For object and ambient sensors that produce continuous values and these values require pre-processing to achieve context information that can be used for activity recognition, we use K-means to cluster the standard deviation of the sensor values into 2 components, indicating whether the sensors are triggered or not. Therefore, even though those sensors produce continuous readings, they are treated as binary sensors in a logical sense. The different preprocessing methods are the results of sensor heterogeneity and motivate the sensor modelling.

For the simulation dataset, we use the tf-idf for selecting the 10 most significant contexts for each activity class as described in Sect. 4.1, and this results in 149 contexts for all the activities (some activities share common contexts).

To validate the feasibility of incorporating new contexts for activity recognition, for the OPPORTUNITY dataset, we create two activity models. The first activity model contains X contexts (X is a parameter and it is varied in our experiment), while the second model also contains dynamically available contexts beside the original X contexts. The X contexts are randomly sampled and this process is repeated 50 times to avoid biases. The average recognition performance of these two activity models is compared in this experiments, and the second model is supposed to have better performance as it contains additional contexts. The same validation method is applied to the simulation data, except that we generate separate datasets for the two activity models. Notice that activity modelling and prediction are introduced in Sects.  4.2 and 4.3, respectively.

Example

The following example illustrates the process of incorporating dynamically available context for activity recognition and adaptation. Suppose we have the activity class make coffee and make tea, characterized by contexts cup, water, and sugar with different probabilities as show in Fig. 8.
Fig. 8

Original activity model of make coffee and make tea

Fig. 9

An example of feature vector with context water, cup and sugar

Fig. 10

Activity models with additional context leaf

Fig. 11

An example of feature vector with additional context leaf present

Fig. 12

An example of feature vector with additional context leaf not present

For the feature vector (shown in Fig. 9) where all the current available contexts cup, water and sugar are present, it is always recognized as make tea since make tea has high posterior probability calculated with the parameters and the feature vector. As make coffee is also characterized by those three contexts with similar probabilities, misclassification occurs when the user is actually carrying out the activity make coffee.

Suppose now we dynamically discover a sensor and it provides the context leaf that is used to characterize activity make tea, then the activity model can be adapted with the parameters mined from the website as shown in Fig. 10. The context leaf provides additional information that is discriminative enough to differentiate make coffee from make tea. For example, feature vector where the context leaf is present is classified as make tea (shown in Fig. 11), while feature vector where the context leaf is not present is classified as make coffee (shown in Fig. 12).

Experiment results

The experiment results are presented in Fig. 13. Notice that in the simulation data, we create one dataset for the original activity model that only contains certain percentage of the contexts, and a second dataset for the adaptation activity model that contains new contexts in addition to the original ones. The percentage of the contexts in the first dataset is varied from 10% to 90% (shown in Fig. 13a–i). Also, the percentage of contexts that are dynamically available is also varied from 10% till all the remaining contexts are added. As for the OPPORTUNITY dataset, the number of contexts in the original activity model is set to 24, so the number of dynamically available contexts is 3.
Fig. 13

Experiment results on simulation data and OPPORTUNITY

From the figures, we can see that we are able to improve the recognition performance by incorporating new contexts dynamically using the knowledge from the websites without supervision. The amount of accuracy improvement is proportional to the number of dynamically available contexts, as more contexts provide more information for the activity models. Also, we are able to achieve high accuracy (97%) using all contexts as shown in the figures, and we believe the reasons are two-fold. First, the accuracy depends on the types of activity classes we are to classify. As most of the activities listed in the previous table are characterized by distinguished contexts, their distinct activity patterns make them easy to recognize. Second, in the experiment we used the context-activity probability matrix to generate the simulation data and then used the probabilities to create the Bernoulli Naive Bayesian activity model. Therefore, the dataset contains a certain amount of biases.

We can also observe that experiments on OPPORTUNITY show a little recognition performance improvement (1%\(\sim\) 3%). This is because OPPORTUNITY is a realistic activity dataset that contains a lot of activity patterns variants. Therefore, it is difficult to recognize the activities, and the extreme example can be seen on the first subject that only experiences 1% f1-score improvement. In addition, we only dynamically incorporate 3 contexts, some of which may not discriminative enough to improve the recognition performance significantly. To validate this assumption, we vary the parameter X from 24 to 15, and present the result in Fig.  14. From the figures we can observe that, with lower X we experience larger recognition performance gain, as decreasing X means that we incorporate more contexts dynamically. However, the recognition performance of activity model suffers as we lower X. This is expected as less discriminative information is available for the original activity model. The marginal f1-score improvement on dataset OPPORTUNITY demonstrates that the general knowledge cannot be personalized to a specific user for achieving high accuracy, and this inspires us to use data-driven method so that the activity model can be adapted to a specific user with his/her own activity data.
Fig. 14

Vary X from 24 to 15 in experiments with OPPORTUNITY

Validation of data-driven method

Validation method

In this section, we use the OPPORTUNITY dataset for validating the data-driven method described in Sect. 5. The preprocessing of the dataset is the same as that described in Sect. 6.3.1. Notice that, we do not demonstrate the data-driven method with the simulation data. The reason is that we aim to demonstrate that the data-driven method is able to personalize and adapt the activity model to a specific user for achieving high accuracy. However, the simulation data is generated using general knowledge, so it does not serve our purpose.

We perform leave-one (run)-out cross validation (LORO-CV) on each dataset. Specifically, one run of the data is left out for validation, and 50% of the left runs are used as the initial training data to create the initial activity model. Classification is performed on the other 50% of the remaining runs and a certain percentage of those data is selected as the adaptation dataset. Notice that in Sect. 5.3.2, we introduced an adaptation data selection method that computes a score for each classified instance and selects the instances scored higher than a threshold for the adaptation. The threshold is set to certain percentile (e.g. 30 percentile) of the distribution over the scores of each class. Finally, the model is validated on the left out run. The rationale of choosing the LORO-CV rather than commonly used 10-fold-CV is threefold. First, the temporal information preserved in LORO-CV can be used for regularization both in the training and testing process. Second, the testing process in LORO-CV classifies the testing instances sequentially, this is more similar to the real-time activity recognition. Finally, in 10-fold-CV, the data in training set and testing are correlated to some extent as the data streaming is segmented with a 50% overlapped sliding window. Therefore, 10-fold-CV does not reflect the real performance of classifier (Hammerla and Plötz 2015).

To emulate the impact of incorporating new sensors (and the context that can be derived from their readings) we first use a subset of the original OPPORTUNITY dataset, and the remaining portion of the dataset is used to emulate sensor readings from newly discovered sensors. In other words, we perform leave-n (contexts)-out cross validation, where the instances in the initial training data contain information of (\(27-n\)) contexts, while the instances in the selected adaptation dataset contain information about all the 27 contexts. This cross validation process can be seen as: we recognize the activity with X (i.e. \(27-n\)) contexts, and then the same activities are recognized with potentially better accuracy with \(X+Y\) (i.e. 27) contexts, where Y (i.e., n) new contexts are provided by newly incorporated sensors. This kind of cross validation is commonly used in zero-shot learning (Cheng et al. 2013). The description of the cross validation is presented in Table 4.
Table 4

Cross validation description

Dataset

Composition

Description

Initial training set

50% of the remaining runs

The 50% data is randomly sampled, each instance contains information about (X) contexts, X is varied from 24 to 15

Adaptation dataset

Classified instances scored higher than the threshold (e.g. 30 percentile of the distribution over the scores of each class), each instance contains information of the 27 contexts

 

Validation dataset

One run

 

All the recognition results are presented in the form of f1-score (f1-score = \(\frac{\text {2*precision*recall}}{\text {precision+recall}}\)).

Impact of adaptation

We study the f1-score gain after incorporating dynamically available context in this experiment. The threshold for selecting the adaptation is set to 30 percentile of the scores of the classified instances, and the number contexts in the initial dataset, X, is varied from 24 to 15. For a given X, we randomly sample X contexts and repeat this process for 200 times to avoid biases. As a result, each point in Fig. 15 represents a round of training, adaptation and validation. The X-axis of each data point is the f1-score before the model adaptation, and Y-axis is the f1-score after the adaptation. Therefore, any data points on the right side of the line \(f(x)=x\) indicate that there is f1-score improvement after he adaptation in these rounds of experiments, larger distance from the line means greater improvement.
Fig. 15

F1-score before and after adaptation across the datasets

Figure 15 shows that incorporating dynamically available context to adapt activity model is able to increase the recognition performance. Generally, the f1-scores of setting \(X=24\) are more stable, while f1-scores of setting with smaller X become more scattered. The underlying reason is that the f1-score improvement depends on the discriminative power of the dynamically available contexts. Therefore, integrating more contextual sources dynamically will provide diversified discriminative information, and results in a more diversified f1-score gain. Basically, the more contexts are incorporated, the higher improvement of recognition performance is expected. To validate, Fig. 16 presents the CDF (cumulative distribution function) of the f1-score gain across the datasets. It can be seen from the figure that incorporating more contexts for adaptation can obtain more f1-gain in general than fewer contexts. Take S3 for example, to achieve f1-score gain more than 20%, the probability is 60% if we incorporate 12 contexts (\(X=15\)). By contrast, the probability is 40% and 20% if we incorporate 9 (\(X=18\)) and 6 contexts (\(X=21\)), respectively.
Fig. 16

CDF of the f1-score across the datasets

Impact of adaptation data

Fig. 17

The impact of the amount of adaptation data on the recognition performance (measured with f1-score)

In this subsection, we examine the impact that the size of adaptation data has on the recognition performance. Figure 17 shows the impact of the amount of adaptation data on the f1-score across the datasets in different scenarios (i.e., number of leave-out contexts). The x-axis represents the number of contexts in the initial train set and the y-axis stands for the f1-score after adaptation. For each \(X\in \{15,18,21,24\}\), the threshold for selecting the adaptation data is varied from 30 percentile to 90 percentile of the scores of the classified instances. Higher threshold means less adaptation for retraining.

From the figures, we can draw the following conclusions. Firstly, activity models with fewer contexts initially are more sensitive to the amount of adaptation data, as shown in the standard deviation of the results. According to Vapnik’s theory (Vapnik 2013), the testing error of a classifier is upper bound by the testing error plus a term that is proportional to the complexity of the classifier and inversely proportional to the amount of training samples. Incorporating more contexts means that the activity models need to estimate more parameters, and hence increases the complexity. Therefore, the amount of adaptation data becomes critical to the testing error, and increasing the amount of adaptation set will lower the testing error dramatically.

Secondly, activity models with more contexts initially perform better than those with fewer contexts. This is because activity models trained with initial train set that contains more contexts are able to yield higher recognition performance, and they can predict the instances with higher accuracy and select the correctly predicted instances for adaptation.

Finally, there is no significant difference in the recognition performance when we vary the threshold from 30 percentile to 70 percentile. However, the f1-scores drop sharply when we set the the threshold to 90 percentile. The reason is that we do not have sufficient adaptation data with high threshold. Therefore, training the parameters with insufficient data results in overfitting and suboptimal activity models.

Influence of regularization weight

In this subsection we study impact of the temporal regularization. The temporal regularization term \(\beta _1\) in Eq. (18) controls the tradeoff between the local contextual information and the information from neighbouring instances. It is involved in both of the learning and prediction processes. Figure 18 illustrates the f1-score as a function of the temporal regularization weight which is varied from 0 to 0.4. As shown in Table 5, the threshold for selecting adaptation data is 30 percentile of the scores, and the number of initial contexts is set to 24. We do not present the results of other settings (i.e. \(X\in \{15,18,21\}\), threshold = 50,70,90 percentile) here as they present the similar trend. From the figure we can see that by putting more weight on the pairwise evidence from neighbouring instances, we are able to smooth out the accidentally mis-classified instances and improve the overall f1-score. After a certain threshold (e.g. 0.2), the recognition performance becomes stable. This figure also shows that it is easy to specify the regularization weight-setting: the weight of any values larger than 0.2 can obtain the optimal result.
Fig. 18

The f1-score as a function of the temporal regularization weight

Table 5

Parameters setting description

Parameters

Value

Threshold for selecting the adaptation data

30 percentile of scores of the classified instances

Number of contexts in the initial train set

X = 24

Comparison with conventional methods

In this subsection, we compare the proposed method with the conventional machine learning methods. All the other settings (e.g. training set, validation set, method of cross-validation) are the same except that the learning process is performed based on the input feature vector \(x_i\) described in Sect. 3.3. During the adaptation process, the instances in the initial training set \(x_i=\{x_i^1,\ldots ,x_i^N\}\) are extended to \(x_i=\{x_i^1,\ldots ,x_i^N,0^{N+1},\ldots ,0^{N+d}\}\) to guarantee a common classifier is trained.

The baselines introduced for the purpose of comparison include: SVM (support vector machine), RF (Random Forest) and LR (logistic regression). The parameters for those classifiers are obtained through the grid search cross validation, as shown in Table 6. The threshold for selecting the adaptation data is set to 30 percentile and the comparison results across the datasets are presented in Fig. 19.
Table 6

Parameters for baselines

Classifier

Parameters

SVM

kernel = RBF, \(\gamma = 0.01\), \(C=10e4\)

RF

\(n\_estimators = 100\)

LR

\(C=10e4\)

Figure 19 shows that the proposed method outperforms all the baselines with a significant margin. We also vary the number of contexts in the initial dataset from 24 to 15 for each dataset. On average, our method is 16.5% (max: 17.4%, min: 14.4%), 17.2% (max: 21.0%, min: 12.7% ) and 16.1% (max: 20.8%, min: 8.34%) higher than the second best baseline on dataset S1, S2 and S3 respectively, in terms of f1-score. These experiments demonstrate the advantage of our method in performing adaptive activity learning, and we believe the underlying reasons are twofold. Firstly, embedding the temporal regularization into the learning and prediction processes enables the proposed method to effectively leverage the temporal characteristic of human activities for obtaining desired predictive outcomes. Secondly, the weighted model in our method learns a weight for each context probability. Therefore, it encodes domain knowledge into the activity model and is equivalent to feature transformation to some extent.
Fig. 19

The impact of the amount of adaptation data on the recognition performance (measured with f1-score)

Comparison with hybrid classifiers

Table 7

Comparison with hybrid classifiers

Classifiers

Number of contexts in the initial train set

S1

S2

S3

X = 15

X = 18

X = 21

X = 24

X = 15

X = 18

X = 21

X = 24

X = 15

X = 18

X = 21

X = 24

SVM + HMM

0.62

0.662

0.697

0.735

0.635

0.685

0.759

0.814

0.675

0.728

0.775

0.858

RF + HMM

0.621

0.658

0.689

0.722

0.637

0.686

0.752

0.803

0.68

0.725

0.765

0.846

LR + HMM

0.631

0.674

0.71

0.752

0.631

0.683

0.751

0.805

0.672

0.723

0.769

0.85

SVM + CRF

0.626

0.668

0.703

0.742

0.64

0.691

0.764

0.82

0.685

0.737

0.783

0.866

RF + CRF

0.627

0.665

0.697

0.732

0.64

0.691

0.76

0.812

0.692

0.737

0.778

0.857

LR + CRF

0.633

0.675

0.711

0.753

0.634

0.688

0.757

0.812

0.679

0.731

0.778

0.86

Ours

0.796

0.839

0.866

0.877

0.815

0.892

0.923

0.928

0.884

0.911

0.923

0.93

It seems unfair to compare our method with the conventional machine learning methods, as they make prediction for each instance independently and do not consider the temporal information of neighbouring instances. Actually, previous work (Cao et al. 2012; Wen et al. 2015a) proposes to combine typical classifiers with graphical models (e.g. HMM, CRF) to smooth out the outliers. As those graphical models make assumptions about the dependency among the latent activity classes of the instances, and the classification of one instance also considers the predictive outcomes of neighbouring ones, we perform a comparison between our method and those hybrid classifiers.

The threshold for selecting adaptation data is 30 percentile of the scores, and the number of initial contexts is varied from 24 to 15. Table 7 presents the comparison results. From the table we can see that our method still outperforms the best baseline in terms of f1-score. Specifically, when compared with the best HMM-hybrid conventional classifier, the average f1-score advantage is 15.29% (S1), 16.55% (S2), and 15.22% (S3), respectively. The average advantage is 15.16% (S1), 16.04% (S2) and 14.28% (S3) when compared with the best CRF-hybrid classifier. As has been confirmed in previous work (Van Kasteren et al. 2008), CRF performs slightly better than HMM. Notice that even though we only query limited amount of the labelled data in the adaptation phase, we are still able to leverage the neighbouring unlabelled instances for temporal regularization [Eq. (18)].

Conclusion

In this paper, we addressed the problem of adaptive high-level activity recognition with dynamically available sensors (i.e. dynamically available additional context). The existing research shows that additional contextual information can potentially improve the recognition accuracy, and sensor addition or replacement is very common in activity recognition systems. Therefore, it is very important that the activity recognition framework is able to evolve to use dynamically available contexts.

As sensors are heterogeneous, we propose to model context of sensors and activities, so when new sensors are discovered the raw sensor data can be pre-processed properly for the recognition task using those models. Based on the models, we propose the knowledge-driven and data-driven methods for activity modelling and activity model adaptation with new contexts. In the knowledge-driven method, we mine external resources (e.g. websites) to specify the parameters in the activity models. While in the data-driven method, we use the predicted instances to learn the parameters of new contexts in the activity models. With the knowledge-driven method, we can perform activity model adaptation with new contexts in an unsupervised manner. However, the knowledge mined from the websites is general across users and cannot achieve high accuracy, due to the fact that people perform activity differently. On the contrary, the data-driven method can personalize the activity model to a specific user with his/her own data.

In the data-driven method, we propose the learning-to-rank approach for activity learning and adaptation. We also propose to select the most informative data for activity model adaptation without supervision. One advantage of the learning-to-rank approach is that we can add various regularization terms to exploit the characteristics of the data. In our work, we add temporal regularization into the learning and testing phases to capture the consistency of human behaviours.

Our experiments based on public and simulation datasets show that we are able to improve the recognition performance by adaptation of the activity model with dynamically available context. The improvements vary and depend on several factors such as the amount of adaptation data, the weight of the temporal regularization and the number of contexts in the initial train set. To validate the advantages of the proposed method in adaptive learning, we compared it with the conventional machine learning algorithms, and the experiments demonstrate that our methods for activity learning and activity model adaptation outperform the baselines with a large margin.

Footnotes

References

  1. Alam, M.A.U., Pathak, N., Roy, N.: Mobeacon: an ibeacon-assisted smartphone-based real time activity recognition framework. In: Proceedings of the 12th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services on 12th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, pp. 130–139. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering) (2015)Google Scholar
  2. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer (2010)Google Scholar
  3. Cao, H., Nguyen, M.N., Phua, C., Krishnaswamy, S., Li, X.-L.: An integrated framework for human activity classification. In: Proceedings of the 2012 ACM Conference on Ubiquitous Computing, pp. 331–340. ACM (2012)Google Scholar
  4. Cheng, H.-T., Griss, M., Davis, P., Li, J., You, D.: Towards zero-shot learning for human activity recognition using semantic attribute sequence model. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 355–358. ACM (2013)Google Scholar
  5. Gonzalez, L.I.L., Amft, O.: Mining relations and physical grouping of building-embedded sensors and actuators. In: 2015 IEEE international Conference on Pervasive Computing and Communications (PerCom), pp 2–10. IEEE (2015)Google Scholar
  6. Gu, T., Chen, S., Tao, X., Lu, J.: An unsupervised approach to activity recognition and segmentation based on object-use fingerprints. Data Knowl. Eng. 69(6), 533–544 (2010)CrossRefGoogle Scholar
  7. Gu, T., Wu, Z., Tao, X., Pung, H. K., Lu, J.: epsicar: An emerging patterns based approach to sequential, interleaved and concurrent activity recognition. In: IEEE International Conference on Pervasive Computing and Communications, 2009. PerCom 2009, pp 1–9. IEEE (2009)Google Scholar
  8. Hammerla, N. Y., Plötz, T.: Let’s (not) stick together: pairwise similarity biases cross-validation in activity recognition. In: Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 1041–1051. ACM (2015)Google Scholar
  9. Henricksen, K., Indulska, J.: A software engineering framework for context-aware pervasive computing. In: Proceedings of the Second IEEE Annual Conference on Pervasive Computing and Communications, 2004. PerCom 2004. pp. 77–86. IEEE (2004)Google Scholar
  10. Hu, P., Indulska, J., Robinson, R.: An autonomic context management system for pervasive computing. In: IEEE International Conference on Pervasive Computing and Communications, 2008. PerCom 2008. Sixth Annual, pp. 213–223. IEEE (2008)Google Scholar
  11. Lukowicz, P., Poxrucker, A., Weppner, J., Bischke, B., Kuhn, J., Hirth, M.: Glass-physics: using google glass to support high school physics experiments. In: Proceedings of the 2015 ACM International Symposium on Wearable Computers, pp. 151–154. ACM (2015)Google Scholar
  12. Maekawa, T., Watanabe, S.: Unsupervised activity recognition with user’s physical characteristics data. In: 2011 15th Annual International Symposium on Wearable Computers (ISWC), pp. 89–96. IEEE (2011)Google Scholar
  13. Maekawa, T., Yanagisawa, Y., Kishino, Y., Ishiguro, K., Kamei, K., Sakurai, Y., Okadome, T.: Object-based activity recognition with heterogeneous sensors on wrist. In: International Conference on Pervasive Computing, pp. 246–264. Springer (2010)Google Scholar
  14. Meng, J., Li, H., Han, Z.: Sparse event detection in wireless sensor networks using compressive sensing. In: 43rd Annual Conference on Information Sciences and Systems, 2009. CISS 2009, pp. 181–185. IEEE (2009)Google Scholar
  15. Perkowitz, M., Philipose, M., Fishkin, K., Patterson, D.J.: Mining models of human activities from the web. In: Proceedings of the 13th international conference on World Wide Web, pp. 573–582. ACM (2004)Google Scholar
  16. Reiss, A., Stricker, D.: Personalized mobile physical activity recognition. In: Proceedings of the 2013 International Symposium on Wearable Computers, pp. 25–28. ACM (2013)Google Scholar
  17. Riboni, D., Bettini, C.: Cosar: hybrid reasoning for context-aware activity recognition. Pers. Ubiquitous Comput. 15(3), 271–289 (2011)CrossRefGoogle Scholar
  18. Riboni, D., Bettini, C., Civitarese, G., Janjua, Z. H., Helaoui, R.: Fine-grained recognition of abnormal behaviors for early detection of mild cognitive impairment. In: 2015 IEEE International Conference on Pervasive Computing and Communications (PerCom), pp. 149–154. IEEE (2015)Google Scholar
  19. Roggen, D., Calatroni, A., Rossi, M., Holleczek, T., Förster, K., Tröster, G., Lukowicz, P., Bannach, D., Pirkl, G., Ferscha, A., et al: Collecting complex activity datasets in highly rich networked sensor environments. In: 2010 Seventh International Conference on Networked Sensing Systems (INSS), pp. 233–240. IEEE (2010)Google Scholar
  20. Stikic, M., Larlus, D., Schiele, B.: Multi-graph based semi-supervised learning for activity recognition. In: 2009 International Symposium on Wearable Computers, pp. 85–92. IEEE (2009)Google Scholar
  21. Sun, F.-T., Yeh, Y.-T., Cheng, H.-T., Kuo, C., Griss, M.: Nonparametric discovery of human routines from sensor data. In: 2014 IEEE International Conference on Pervasive Computing and Communications (PerCom), pp. 11–19. IEEE (2014)Google Scholar
  22. Tapia, E. M., Choudhury, T., Philipose, M.: Building reliable activity models using hierarchical shrinkage and mined ontology. In: International Conference on Pervasive Computing, pp. 17–32. Springer (2006)Google Scholar
  23. Van Kasteren, T., Englebienne, G., Kröse, B.J.: Transferring knowledge of activity recognition across sensor networks. In: International Conference on Pervasive Computing, pp. 283–300. Springer (2010)Google Scholar
  24. Van Kasteren, T., Noulas, A., Englebienne, G., Kröse, B.: Accurate activity recognition in a home setting. In: Proceedings of the 10th International Conference on Ubiquitous Computing, pp. 1–9. ACM (2008)Google Scholar
  25. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidleberg (2013)zbMATHGoogle Scholar
  26. Wang, S., Pentney, W., Popescu, A.-M., Choudhury, T., Philipose, M.: Common sense based joint training of human activity recognizers. IJCAI 7, 2237–2242 (2007)Google Scholar
  27. Wen, J., Indulska, J., Zhong, M.: Adaptive activity learning with dynamically available context. In: 2016 IEEE International Conference on Pervasive Computing and Communications (PerCom), pp. 1–11. IEEE (2016)Google Scholar
  28. Wen, J., Loke, S., Indulska, J., Zhong, M.: Sensor-based activity recognition with dynamically added context. In: proceedings of the 12th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services on 12th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, pp. 1–10. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering) (2015a)Google Scholar
  29. Wen, J., Zhong, M., Indulska, J.: Creating general model for activity recognition with minimum labelled data. In: Proceedings of the 2015 ACM International Symposium on Wearable Computers, pp. 87–90. ACM (2015b)Google Scholar
  30. Wu, J., Osuntogun, A., Choudhury, T., Philipose, M., Rehg, J. M.: A scalable approach to activity recognition based on object use. In: IEEE 11th International Conference on Computer Vision, 2007. ICCV 2007, pp. 1–8. IEEE (2007)Google Scholar
  31. Wyatt, D., Philipose, M., Choudhury, T.: Unsupervised activity recognition using automatically mined common sense. AAAI 5, 21–27 (2005)Google Scholar
  32. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. Inf. Theory 51(7), 2282–2312 (2005)MathSciNetCrossRefGoogle Scholar
  33. Zhan, K., Faux, S., Ramos, F.: Multi-scale conditional random fields for first-person activity recognition. In: 2014 IEEE International Conference on Pervasive Computing and Communications (PerCom), pp. 51–59. IEEE (2014)Google Scholar
  34. Zhou, B., Cheng, J., Sundholm, M., Reiss, A., Huang, W., Amft, O., Lukowicz, P.: Smart table surface: A novel approach to pervasive dining monitoring. In: 2015 IEEE International Conference on Pervasive Computing and Communications (PerCom), pp. 155–162. IEEE (2015)Google Scholar

Copyright information

© China Computer Federation (CCF) 2018

Authors and Affiliations

  1. 1.Guilin University of TechnologyGuilinChina
  2. 2.Airborne Troops Training BaseGuilinChina
  3. 3.The University of QueenslandBrisbaneAustralia

Personalised recommendations