Ziyang Yang – Ziyang Yang

Overview two changepoints algorithms – PELT and FPOP

Ziyang Yang — Sat, 11 Dec 2021 20:18:53 +0000

Hi! As you know I have started my first year of PhD, and my research direction is anomaly detection. And it could be seen as some special changepoint questions. So I started to write some notes during my PhD learning period. This is suitable for people already familiar with changepoint algorithms and who wants to have a look. I am happy to discuss with you my notes if there is something I misunderstand:)

To detect the change!

The story begins by detecting the change. Assume we observed data from time $ 1 $ to time $ t $ ; we could name this sequence of observations by $ y_1, y_2,…, y_t $. We aim to infer the locations and numbers of these changes – that is, to infer $ \tau $ . This question could be easily transferred to a classification problem, that is, how could we classify an ordered sequence into several categories as good as we can. Based on this idea, we could write the following minimisation problem:

$$ \min_{K \in N ; \tau_1,\tau_2,…,\tau_K} \sum_{k=0}^K L(y_{\tau_k+1:\tau_{k+1}}), $$

$ L(y_{\tau_k+1:\tau_{k+1}})$ represents the cost for modelling the data $ y_{\tau_k+1},y_{\tau_k+2},…,y_{\tau_{k+1}} $ as a segment, it could be the negative log-likelihood; $ K $ is the total number of changepoints. Thus, this problem means we want to classify the ordered sequence into $ K+1 $ segments and make sure the classification could achieve the minimum log-likelihood. A question is that if we directly use this minimisation expression, we will always assign only one observation into one segment. Imagine that if we have 3 data points, when $ K=2 $ , we always achieve its minimum coz the cost for each segment is 0 and the sum is 0 too. Thus, we need a constrain!

Constrained problem and penalised optimisation problem

If we know the optimal number of changepoint $ K $ , we could force our algorithm to find $ K+1 $ segments. This is the constrained problem:

$$Q_n^K=\min_{K \in N ; \tau_1,\tau_2,…,\tau_K} \sum_{k=0}^K L(y_{\tau_k+1:\tau_{k+1}}), $$

Solve the RHS of the above equation by dynamic programming; we will get the optimal location of $ K $ changepoints. However, this requires prior knowledge of the number of changepoints. When we don’t know the exact number of changepoint, we could try to give a maximum number of changepoints, e.g.10; then compute $ Q_n^k $ for all $ k=0,1,2,…,10 $ , and choose the minimum $ Q_n^k $ as the final solution. In general, it could be written like:

$$\min_k[Q_n^k+g(k,n)],$$

where $ g(k,n) $ is the penalty function for overfitting the number of changepoints.

Given this idea above, a group of clever people think why not transfer the problem into a penalised one. If the penalty function is linear in $ k $ , say, $ g(k,n)=\beta k, \beta>0 $, we could write it as:

$$Q_{n,\beta}=\min_{k}[Q_n^k+\beta k]=\min_{k,\tau}[\sum_{k=0}^K L(y_{\tau_k+1:\tau_{k+1}})+\beta ] – \beta$$

This is known as the penalised optimisation problem. And all our stories come from here.

To solve the penalised optimisation problem, we could use dynamic programming! Given $ Q_{0,\beta}=-\beta $ , we have

$$Q_{t,\beta}=\min_{0\leq\tau

for $ t=1,2,…,n $ . However, solving this one requires $ O(n^2) $ time complexity! Because of each step, we have to recalculate the cost for each observation. In order to reduce the computational complexity, two approaches have been proposed – PELT and FPOP!

PELT

Let’s look at some maths first! Assume there exists a constant $a$ and

$$Q_\tau + L(y_{\tau+1:t}) +a >Q_t,$$

$ y_\tau $ will never be the optimal change point in the future recursion. The LHS of the inequality above means the cost for modelling data have a changepoint at $ \tau $ until time $ t $ , while the RHS represents the cost without changepoint at $ \tau $ . Thus, in the following calculation, we do not need to consider the possibility that $\tau$ is the optimal last changepoint. So, the search space is reduced!

In each recursion, instead of calculating the cost and finding the minimum, we need to add an additional step to check if the inequality holds. In the worst cases, the computational complexity is still $ O(n^2) $ . However, when the number of changepoint increases with the number of observations, PELT could achieve linear computational complexity!

FPOP

FPOP is functional pruning. In the optimisation problem, we firstly model the segment then find the minimisation. But functional pruning swaps the order of minimisation, that is to find the segment for different parameters. Assume cost function could be represented as the sum of $ \gamma(y_i,\theta), $ you will be very clear if we do some maths:

$$\begin{split} Q_t&=\min_{0 \leq \tau < t} [Q_\tau+L(y_{\tau+1:t})+\beta]\\&=\min_{0 \leq \tau < t} [Q_\tau+\min_\theta \sum_{i=\tau+1}^t \gamma(y_{i,\theta})+\beta]\;\;\;\; (\mathrm{independent\; assumption})\\&=\min_{\theta} \min_{0 \leq \tau < t} [Q_\tau+\sum_{i=\tau+1}^t \gamma(y_{i,\theta})+\beta]\\& = \min_{\theta} \min_{0 \leq \tau < t} q_t^\tau(\theta)\\&=\min_{\theta} Q_t(\theta),\end{split} $$

where $ q_t^\tau(\theta) $ is the optimal cost of partitioning the data up to time $ t $ conditional on the last changepoint being at $ \tau $ with the current parameter being $ \theta $ ; and $ Q_t(\theta) $ is the optimal cost of partitioning the data up to time $ t $ with the current segment parameter being $ \theta $ . Simply, assume we have two possible changepoint candidates $ \tau_1 $ and $ \tau_2 $ , $ \tau_1 $ will never be the optimal partition if $ q^{\tau_1}_t > q^{\tau_2}_t $ , then we could simply prune $ \tau_1 $ .

You could easily find the difference. Before in penalised optimisation problem, we firstly find the possible partition and then estimate the parameter in the segment, and then find the minimum cost. As a result, the associated partition is the solution we want. But here, we have possible estimated parameters, for each estimated parameter, we have corresponded partition. Then the minimum cost over all possible parameters is the solution we want.

The equation above could be easily solved by dynamic programming:

$$Q_t(\theta)=\min \big\{Q_{t-1}(\theta), \min_\theta Q_{t-1}(\theta)+\beta\big\}+\gamma(y_t,\theta)$$

The former one in the bracket means there is no new changepoint in the last iteration, while the latter one in the bracket means that add a new changepoint in the last iteration after updating one observation. In the worst case, the time complexity is $ O(n^2) $ ; but in the best case, the time complexity is $ O(n\log n) $ .

I will explain the Figure here, and you will understand how FPOP prune candidate changepoint.

This is a direct screenshot from Paper (On Optimal Multiple Changepoint Algorithms for Large Data, arxiv: 1409.1942

At time $ t=78 $ , we stored $ 7 $ intervals with associated $ \mu $ . Since we assume $ \gamma $ function is a negative loglikelihood for normally distributed data, it shows a quadratic shape in the Figure. At time $ t=79 $ , we observed new data, so we recompute the cost function and get the result as the middle Figure shows. In the middle Figure, notice that the purple line is not optimal anymore (it is above all the curves), and the purple line corresponds to the cost when the changepoint is located in 78. Thus, we could prune this candidate that we will not consider $ y_{78} $ to be the possible changepoint anymore (as shown in Figure (c)).

It is NOT a failure for a PhD to work in the industry instead of academia

Ziyang Yang — Tue, 27 Apr 2021 21:26:25 +0000

I am currently an MRes student at STOR-i, and I was thinking about the possible career path (either industry or academia) since we have to choose either strategic or industry PhD projects. Last week, I attended an online career event in stats held by , discussing the possible career path for a PhD student. That reminds me of a recent popular question: Is it a failure for a PhD to work in the industry after graduation? Most people think it is a shame that PhD students do not continue to work in academia after spending several years obtaining a PhD degree, and they could not back to academia without research.

This reminds me of the reason why I attend the STOR-i programme is the close incorporation with industries. As we know, stats is a very applicable discipline. That is why I like stats since stats could directly impact our daily life, such as detecting changepoint, predicting the earthquake by extreme value theory, testing the efficiency of the clinical methods or drugs, etc. And STOR-i programmes will allow me to directly work with the company to solve real-world problems after I finish the first MRes year :). Naturally, my career plan is to work in the industry after graduation (but maybe change my idea afterwards).

Really???? I was also scared by choosing the industry path as I originally thought I could never do research and return to academia.

Working in the industry doesn’t mean no publication

Prof. Tawn Jonathan said most PhD went to the industry due to the direct effect of their work and they still have publications in the industry. When I wrote my research report on Thompson sampling, I found the Microsoft research sectors published several famous papers. It seems reasonable since industries also need creativity, and they could find real-world open questions easily. So, working in the industry does not equal no research.

Working in the industry doesn’t mean no way return to academia

Besides, it is also possible to return to academia from industry but on different routes. The possible normal route for academia is PhD – Post Doc – Lecturer – Senior Lecturer – Reader/Professor. The possible route for academia returning from industry is PhD – Industry – Lecturer – etc. Thus, people working in the industries could also back to academia.

Working in industry project doesn’t mean no theoretical development

Originally I thought the industry project might be applicable and no theoretical development. When I met with possible supervisors, they said we have to develop a new method based on the questions provided by the companies. So, we are not working as employees of their company doing similar and regular works. Indeed we have to create a new methodology, new algorithm or new theory!! Just like other academic projects. Besides, the research scope is not fixed like the company asks us to solve a typical question. Usually, the scope is flexible enough and allow us to boost both academic and industry development at the same time.

As a result, working in the industry doesn’t mean the end of academia. And it allows you to see the direct impact. Now I am much clear about my path and hope this blog could help you if you have similar questions.

Handling human stubbornness when people think they are smarter than data science!

Ziyang Yang — Sun, 25 Apr 2021 17:05:46 +0000

Last month, we had a problem-solving day talking about handling human stubbornness during the implementation of data science. You may hear data science could make good decisions, like data science help groceries to make a better decision. And you may be also familiar with the travel route planned by Google map! Like us, delivery companies also plan an optimized route for drivers. For example, drivers for delivery companies (e.g. Amazon) have to deliver hundreds of parcels to many different addresses every day. And these companies use vehicle routing models to compute the best routes for delivery drivers (maybe the routes has the least time).

At left is a delivery route computed by the Last Mile team’s optimization software, at right the route that a delivery driver actually chose to drive. (Map details have been omitted.) Green symbols (A and B) indicate the driver’s starting locations, purple symbols (also A and B) the ending locations. – cited from Amazon

However, usually, drivers deviate from the optimised delivery route computed by data science to reduce the journey time. This is because usually, they think they are more familiar with this area and have their own driving habits. Unfortunately, most time, they increase the time required to unload packages from the van at stops.

Data scientists always consider how accurate their model is but paying less consideration to monitor the implementation of the whole process. However, make sure the whole process as the plan is much harder than designing an efficient algorithm. Recently, Amazon and MIT hold a new , that they want to find a solution that reducing the probability of drivers’ deviation.

Our group discussed this problem and considered 2 possible ways to help improve the drivers’ loyalty:

Specific personalised driver routes

Most drivers are deviating from the route plan since they have their own driving habits. So, why not design an optimised route combined with their driving habits?

We could collect information:

Driver information: driving experience, driving years, age, familiarity with delivering area, customer satisfaction, etc…..
Feedback from the driver: satisfaction about the route plan, reasons for deviations, unusual traffic report, etc…
GPS data: track the deviation, estimated optimal time and real-time.

With these data, could learn how to design an optimised route incorporating drivers’ preference.

Reward and penalty system

This idea is not related to techniques but psychology. It is also useful if we could set a reward and penalty system:

•Reward drivers for loyalty to prescribed routes and reward drivers who deviate the route but finish their travel in a time shorter than optimised time.

•Penalties for deviations which cause delays

We only came up few ideas during the problem-solving day. There still are other good ideas to tackle this issue, and the competition of Amazon is operating now! During the discussion, we are given a video, and we found this is really helpful! (after 32 minutes, this video talks about the specific personalized route planning)

How could particle filter track Thanos? – explaining particle filter without mathematics

Ziyang Yang — Sat, 24 Apr 2021 23:43:24 +0000

This blog will give an general idea about the principle of particle filter without mathematical proofing.

Recently, we are given talks about particle filter (or sequential Monte Carlo). Particle filter has a wide application in signal processing, tracking objects, time series, finance, etc. In the beginning, I am also scared by the maths of particle filter, like partial differential equations and Bayesian stuff. However, the idea behind the particle filter is very straightforward and intuitive.

Now, lets set a situation to explain it under the tracking problem without mathematics!

Scenario Setting

Assume we are agents of located in ‘chessboard’ country (that is because the map of this country is like a chessboard ). One day Thanos came to this country and said he steal our magic stone and then he just left away and hid somewhere in our city.

Our aim: we are told to trace him before the avengers came.

He is Thanos. You only need to know he is a bad guy if you don’t know him. If we can’t find where he hides and take our magic stone back, he will destroy the whole world!!!! And the avengers will help us if we could successfully find his location!

How to find him?

Luckily, we have three clever dogs named ‘particle A’, ‘particle B’, and ‘particle C’. They have already remembered the smell of Thanos! And we are also clever enough to communicate with them.

Every 10 minutes, they could go head 1 grid in the map based on their own ideas. And then our dogs have to report their location and how likely Thanos come across these areas.

Time=0 minutes

At the beginning, particle B said it is 90% sure there is Thanos’scent. So we think at the beginning Thanos are more likely to go across the middle way.

Time=10 minutes

After 10 minutes, our dog moved to new locations and reported their location and their findings. Thanos seems go up from the middle way. as Particle A was 60% sure.

Time=70 minutes

The purple line is the real trace of Thanos

At T=70, we report the most likely route to the avengers (blue line in the figure) based on our three dogs findings. We traced Thanos very well at the beginning. But after T=40, we are far away from his real trace! Finally, our task failed, the avengers could not find him, and he destroyed our world at the end!

Why are we failed? Our dogs are clever, and we are clever. Wait! We track his route very well initially, but after time = 40, particle B and particle C always explore the bottom area. Particle B and particle C said they are not sure Thanos came here, but particle A always said he smelled Thanos’ scent. So we could only rely on Particle A. That’s why we sure Thanos go as particle A’s route.

Particle filter without resampling
The example above illustrates the principle of particle filter without resampling. Each time step, several particles will move follows the transition distribution (like our dogs follow their mind to go head). And based on the evidence (scent of Thanos in our case), we could get the possible location at each time step. Along the time, we could get a trace. However, particle filter without resampling always fails in the long run due to weight degeneracy that the trace is concluded from few particles (In our example, this means we only rely on Particle A after T=40).

Time goes back

Okay, so now the avengers make the time back and we could search again. Particle B and Particle C always explore the locations in the right-bottom corner where Thanos obviously not been there. So let them find those locations is a waste.

New rule: If one of the dogs found Thanos most likely came across their area, we introduce a new dog in this area to pinpoint search. Additionally, if a dog has the least amount of certainty, we remove this dog.

Time = 0 minutes

Particle B is 90% sure while Particle C is only 2% sure. So we move particle C and give one more dog on the area where Particle B is.

Time = 10 minutes

Next, our particles remove follow their mind. Since Particle A is 60% sure it smelled Thanos, we introduce a new dog in its area. Since Particle C in the bottom grid only 5% sure it smelled Thanos, we discard it and let it go back home.

Time = 70 minutes

As we see, at time 70 we successfully trace Thanos, and we save the world!

Particle filter with resampling
Due to the limitation of particle filter without sampling, we introduce the resampling scheme in the process. It is easy to understand as we duplicate particles when they have a high probability of getting the right path (like introducing a new dog in that area). In addition, we discard those particles with less probability to find the true path.

This is the intuitive idea behind particle filter! Now you could understand the whole process of particle filter!

For more reading:

This is a really good review paper!

This is a really good tutorial! And the references are very famous papers!

This is a really good cartoon to show the whole process:

Statistics in Social Science(3): Step-by-Step tutorial on One-way ANOVA test

Ziyang Yang — Wed, 14 Apr 2021 11:18:52 +0000

This blog will explain the one-way ANOVA test in detail (including assumptions, implementing situation and explanation), and an example analysed by R will be shown at the end.

What is this test for?

You may be familiar with the t-test and some other nonparametric test used to test if there is a difference in the mean between two groups (e.g., if there is a difference in mean score between two classes; if one treatment is better than another treatment). The one-way analysis of variance (ANOVA) is used to determine if there is a significant difference among the means of three or more independent groups. For example, the application situation could be:

if there is a difference in mean score among the four classes
if there is a difference in the mean effect among the three types of treatment

Assumptions:

There is no free lunch. To implement the one-way ANOVA test, it should satisfy three assumptions:

The variable is normally distributed in each group in the one-way ANOVA (technically, it is the residuals that need to be normally distributed, but the results will be the same). For example, if we want to compare the mean score on three classes, the score should have a for each class.
The variances are homogenous. This means the population variance in each group should equal. For example, the scores of the students in the three classes should fluctuate by a similar level.
The observations should be independent. This means one observation will not influence other observations. For example, student A’s grade will not influence student B’s grade as they took their exam independently.

All three test will be tested before implementing one-way ANOVA test. Now, let’s look at how to implementing ANOVA test through R.

How to do it and explain it (An example in R)

Let’s use the dataset in R called ‘PlantGrowth’. It includes the weight of 30 plants with three groups (10 plants will not receive any treatment (control group), 10 plants receive treatment A, and 10 plants receive treatment B). And our purpose is to find if there is a difference in the mean effect among the three groups?

Firstly, lets draw a boxplot to see the data graphically.

From the boxplot, we could conclude that treatment 1 has a lower effect than the control group, but the difference is not too large. And plants received treatment 3 has a larger weight than the other two groups.

Next, we measure the difference through One-way ANOVA, and we got the result:

res.aov <- aov(weight ~ group, data = data)
# Summary of the analysis
summary(res.aov)
            Df Sum Sq Mean Sq F value Pr(>F)  
group        2  3.766  1.8832   4.846 0.0159 *
Residuals   27 10.492  0.3886                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Interpretation

Under a 5% significance level, the P-value of the test is less than 0.05 (P=0.0159<0.05). So we could conclude there is a significant difference among groups.

However, we could only say there is a significant difference among groups, but we don’t know which pairs of groups are different. To understand if there is a difference between specific pairs of groups, we could implement Tukey multiple pairwise-comparisons:

TukeyHSD(res.aov)
  Tukey multiple comparisons of means
    95% family-wise confidence level
Fit: aov(formula = weight ~ group, data = data)
$group
            diff        lwr       upr     p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl  0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1  0.865  0.1737839 1.5562161 0.0120064

Under a 5% significance level, we could conclude that treatment 2 is significantly better than treatment1 on the mean weight of the plant. However, there is no statistical evidence that treatment 2 is better than treatment 1, and treatment 1 is worse than receiving no treatment.

Checking the assumptions

Now lets check the assumptions:

Normally distributed assumptions. On the QQ plot, most points lie on the straight line except point 4, 15 and 17. However, we only have a small sample size (30 plants), so it is reasonable to see a normal QQ plot like this. We could also test the normality through the Shapiro-Wilk normality test. Under the 5% significance level, we could not reject the null hypothesis that the residuals are normally distributed.

shapiro.test(x = residuals(res.aov) )

	Shapiro-Wilk normality test

data:  residuals(res.aov)
W = 0.96607, p-value = 0.4379

Homogenous variance assumption: From the Residual vs Fitted plot, we could see slight evidence of non-constant variance since the degree of dispersion for each group is different. However, it seems not serious. LeveneTest could also be done to test the homogeneity of variance. Under 5% significance, we could not reject the null hypothesis (P-value>0.05) to assume the homogeneity of variances in the different treatment groups.

leveneTest(weight ~ group, data =data)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  2  1.1192 0.3412
      27

Independent assumption: This assumption needs more consideration. In our example, we could assume satisfying this independent assumption since the weight of one plant will not influence the weight of other plants.

That’s all done! This blog references the blog which including specific R code:

Besides, I also found useful blogs which using SPSS to do one-way ANOVA test:

How do grocery stores know pregnancy? Why is the beer aisle always next to the diapers? – Data science in the retail industry

Ziyang Yang — Fri, 02 Apr 2021 16:55:36 +0000

Recently, I was interested in the application of data science. It is well known that data science could be used in personalised treatment, advertising, beating credit card fraud, etc. To my surprising, data science has already been used in the retail industry for long years. In the competitive retail industry, how could data science do? Let’s look at two real examples in the supermarket.

Grocery store know pregnancy

In 2003, an angry man went to his local supermarket ‘Target’ to complain with the manager that his teenage daughter received a personally addressed flyer from this supermarket. And the flyers were advertising maternity products, babywear, baby furniture, nappies and infant formula.
“are you trying to encourage my daughter to get pregnant?”
Some weeks later, the store manager rang back to apologise once again. It turns out his teenage daughter was, in fact pregnant, and she hadn’t told her parents yet.
–

So, how did Target knows the daughter gets pregnant before his father? The answer is to find the correlation through . Briefly speaking, correlation is a statistical term, and it measures the degree of relevance. Target supermarket researchers found that pregnant women are more likely to purchase a certain type of products. This means there is a high correlation between pregnancy and purchasing a certain type of products. This blog illustrates the correlation with a simple example.

Statisticians in Target uses the customers’ shopping habits and physical condition to find the relationship. Shopping habits are usually fixed, but customers will make changes when some events happen. For example, customers purchase more toilet rolls and hand sanitisers than usual during Covid-19. Pregnant women usually switch from scented soap to unscented soap and start to buy supplements such as calcium, magnesium and zinc. With such customers habit prediction models, Targets’ annual sales rise from 44 billion dollars to 67 billion dollars between 2002 and 2010, which indicates the success of prediction models based on association analysis.

Beers and diapers

This reminds me of another real example in the retail industry – ‘beer and diapers’. Walmart found that men are more likely to purchase beers and diapers together at once shopping. Similarly, they use the association analysis to study the correlation among shopping history and find out beers and diapers are commonly purchased together. Thus, you may find the beer aisle is always next to the diapers.

Furthermore…..

As most of us already know, the location of the aisle is designed exactly by statistical research. But it is too outdated. Nowadays, supermarkets are using data science to do more things. For example, they use data science to help them make good promotion methods; they predict the sales of certain products (like umbrellas, ice creams and so on) based on the weather to prepare enough stock. Besides, they also use data science to manage inventory based on the prediction of the next day. Watching BBC documentary ‘‘, you will be amazed how much data science they have been used to provide a more convenient shopping environment to attract you!!

However, data science will not always true?

Such models are based on typical rules; as long as user fit their rules (e.g., purchasing unscented soap and supplement), they may get the result like pregnancy. Although such a model is blatant and crude, it indeed helps retailers make better decisions resulting in high profits.

However, is Targets’ model always true? A woman may start to buy supplements and unscented soaps when she becomes allergic and want to improve her physical condition by supplements. Still, the model thinks she has a high probability of being pregnant. In other words, pregnant women are more likely to purchase supplements and unscented soaps, but purchasing supplements and unscented soaps don’t mean pregnant. That’s the difference between correlation and causality. And such causality somehow relates to the prediction ability of the model.

Third nature, Inc

We could easily get the correlation result as it just the mathematics. However, the result only indicates correlation instead of causality without context. For example, in summer, the sales of ice cream will increase, and simultaneously sharks become active. We could easily get the high correlation result between ice cream and shark attacks. However, does this relationship make sense? That is to say that there’s no causal relationship in either direction — neither of these things causes the other, even indirectly.

So, the useful analysis will need both highly correlated and causal relationship. And high causality to some extent is related to the high prediction ability of the model.

In conclusion, we have introduced two famous examples about data science in retail industry. And we also recommend a good BBC documentary. Finally we introduce that data science will not always be true due to the causality. When searching data science in the supermarket, I found these blogs are very useful:

This blog talks about more generally how data science and AI changes supermarket

It talks the difference between correlation and causality.

It talks about the association analysis in a mathematical way.

It explained the association analysis in detail and in a technical way with R code.

Statistics in Social science (2): Explaining Linear regression

Ziyang Yang — Sun, 14 Mar 2021 23:10:34 +0000

This blog will give you a real example how to explain linear regression

Why we need linear regression?

People in social science uses linear regression frequently. Scientists often use it to measure the relationship between a dependent variable and independent variables. And most real-world situations could be modelled and therefore explained. Basically, if we have data, and if you want to:

Explain some situation: eg: the relationship between wage and education, gender, working experience and so on. Does higher education lead to a higher average wage? Is there a gender difference in the average wage?
Predicting : eg: what’s the pupils’ grade next semester given his/her current performance.

You could use linear regression. Linear regression often expressed as the Equation below. The dependent variable is the variable we want to explain, and independent variables are factors associated with dependent variables. The coefficient and constant will be estimated by computers, and we will explain them later. When there is only one independent variable, it is called a simple linear regression model. Multiple independent variables indicate multiple linear regression.

When I was trying to find resources about linear regression, most tutorials only focused on building the model in software while fewer mentions explanation. Moreover, there is no tutorial talking about the variable choice before modelling. Here, we assume our readers are confident with building models with different software, and we only focus on the process and explanation of model construction.

What factors have to be included in the model?

Directly building a model with all variables is not sensible, especially when the number of variables is large. And you don’t want to explain something by irrelevant factors like explain a pupil’s grade by the number of animals in the zoo. So, we only want relevant factors in our model.

Typically, there are two rules to find relevant factors:

Plotting the relationship between the dependent variable and independent variable or calculating the correlation coefficient. If there is a linear relationship or the correlation coefficient is reasonably large, we could consider include this independent variable in our model. To find out how these are measured, please read this blog.
Literature review. Find out which variable has been used in other similar researches. Therefore, we could add these variables to our model.

If a variable fits either rule, it could be selected to build the model.

For example, we will measure what factors will influence peoples’ feeling about the life satisfaction ladder (figures from 1 to 7 with 1 represents dissatisfaction totally and 7 represents satisfaction totally). From the literature review, researches show age and gender has an impact on life satisfaction. The relationship plot and correlation coefficient also support this argument. Therefore, we added both age and gender in our model to explain life satisfaction.

How to explain the final model?

Fitting model and checking model is a technical and complex process, we don’t show the whole process here.

Recall our example about life satisfaction, we have the mathematical expression like:

$Lifesatisfaction = 7.101 + -0.099 \times Age +0.111 \times Gender (Female)$

How to explain it?

7.101: Intercept. This is the average life satisfaction for reference people. In our example, it means the average life satisfaction of female 0-year old (it is impossible) people is 7.101. Therefore, we could see sometimes the intercept has no practical meaning.
-0.099: Slope for continuous variable. In our example, it means for the same gender, the average life satisfaction will decrease 0.099 as one year increase in age. It is reasonable when people are growing up, more things have to consider and they are not satisfied with their life as before.
0.111: Slope for the discrete variable. In our example, it means under the same age, on average females are more satisfied with their life than males.

Now, you are able to select the variables for models and explain the basic linear models.

For more technical blogs on model construction:

How to build models in R:

How to build models in SPSS:

For more readings about the linear regression:

3 steps to build own R package – Rcpp

Ziyang Yang — Sat, 13 Mar 2021 18:10:19 +0000

This blog is to give ideas how to build R package through Rcpp and C++. Here we assume our readers are confident of C++, Linux and R.

This semester we have been trained to use C++ and Rcpp to write the R package. It is well known that the computing speed of R is slower than C++. Rcpp is an R Package that combines C++ and R. With Rcpp, it could easily transfer the algorithm or functions between R and C++, providing high-performance statistical computing to most R users. It is useful when statisticians want to develop their own R package. So, I will write it in 3 steps and using an Example.

Step 1: Write your own algorithm in C++

Firstly, you have to write your own algorithm in C++ in a Linux system. And next, we have to add some code in C++ to make sure it could be translated by R:

Add #include at the beginning

Add //[[Rcpp::export]] in your main function

Add user interrupt through Rcpp::checkUserInterrupt(). It allows users to terminal algorithm when it runs too long.

Step 2: Build package

After obtaining our ‘cpp’ algorithm, we have to package it as a ‘zip’ file so it could be easily downloaded by anyone who wants to implement it in R. To do so, we have two simple steps:

1. Building Skeleton Folder

Skeleton folder just like the skeleton, containing all the main programs here. It is very simple to create: in R, run the code:

package.skeleton(“The name of package”, cpp_files=”path to your c++ file”, example_code=FALSE)

In my example, I created a package called ‘finaljarvismarch’, and write the path to my cpp file in ‘cpp_files’. If example_code=TRUE, the package will contain an example code.

This creates the package skeleton in the working directory. It contains three files and three folders:

Man: It contains the Rd file, which is the description shows in R.
R: It contains all “.R” files written by R.
Src: It contains all “.cpp” files written by C++.

We could manually put our further R function or C++ function in different folders.

2. Building Package

Once we got the skeleton folder, run the command in terminal to create the package:

R CMD build PackageDirectory/PackageSkeletonName

This builds the package , which can then be sent to and installed on any machine running R.

Step 3: Checking instalment

Until now, we have successfully create a package. However, we have to test if it could be install appropriately.

Run the command in the terminal in the directory:

 R CMD INSTALL PackageTarBallName

Luckily without any error! Now, our package could be downloaded as a tarball by any user, and successfully install in R. To use package, directly run: library(‘PackageTarBallName’)

Example: Jarvis March algorithm

We are asked to build a package, you could download the here. After download it, it could be installed easily

Run the command in the terminal: R CMD INSTALL finaljarvismarch
Run the code in R: library(finaljarvismarch)

Now you could use Jarvis march algorithm for 2 dimension data. In this package, it contains two functions:

findpoint_jarvis(x,y): inputting the x and y, it outputs the points in the convex hull.
plot_jarvis(x,y): inputting x and y, it returns a plot containing all the data points and convex hull.

For example, simulating 100 points, and run the functions:

x and y is the corresponded coordinates of points on the convex hull, and we could also draw the plot:

Statistics in Social science (1): How to choose an appropriate correlation test?

Ziyang Yang — Fri, 26 Feb 2021 11:47:00 +0000

This blog will give you the idea of choosing an appropriate statistical correlation test in social science area.

Recently I am talking with friends who are studying in the social science area, and they are confused about how to use statistical test appropriately. So I decided to write a series of blogs talking about the common statistical method in social science and how to explain the result.

In social science, it is common to calculate the association between two variables. For example, you may want to test the relationship between smoking and lung cancer, consumption and income, etc. The test method could be summarized in the table below under different variables and different distributions. In this blog, we only measure two continuous variables.

Two continuous variables
Normal distributed?
Not normal distributed
Two categorical variables	;
one continuous variable and one continuous variable	;

Analysis of Correlation

Drawing the plot – a direct way

The first step of measuring the correlation is drawing the plot:

Assume we have continuous data y1, x1, x2, x3 and x4. From the plot above, we could see the correlation between y1 and x1 is a positive linear correlation; y2 and x2 seem no apparent linear correlation and non-linear correlation; y1 and x3 have negative linear correlation and y1 and x4 have a non-linear correlation.

Calculating the correlation coefficient – a mathematical way

The correlation coefficient is a statistic measuring the strength of the linear correlation. Usually, there are two ways: the Pearson correlation coefficient and the Spearman correlation coefficient. Although you may want to report the P-value of the correlation test, it is necessary to report the coefficient at the same time.

Pearson correlation coefficient

Pearson correlation coefficient could be calculated in R by cor() function. It is the most commonly used statistics; However, it assumes normal or bell-shaped distribution for continuous variable. We didn’t check the assumption here but it has to be done in real data analysis.

The correlation coefficient ranges from -1 to 1. The sign measures the direction of correlation: positive refers to the positive relationship while negative value refers to the negative relationship. The absolute value measures the strength of the correlation. Usually, the absolute value |value|>0.7 could be considered as a strong correlation.

From the example we could see, y1 and x1 have a strong positive correlation; the correlation coefficient between y1 and x2 is really small only 0.016; y1 and x3 have a strong negative correlation; while y1 and x4 have a mild correlation. Note: Here we could only say they have a linear correlation since Pearson ignore the non-linear relationship.

> cor(y1,x1)
[1] 0.8708785
> cor(y1,x2)
[1] 0.01631352
> cor(y1,x3)
[1] -0.9145617
> cor(y1,x4)
[1] 0.405236

Spearman Rank correlation coefficient

Unlike Pearson’s method, Spearman’s method does not assume the distribution of the variables. Usually, we got a similar result to Pearson (as the result we see below). The difference between the Spearman rank correlation and Pearson rank correlation is that Pearson only takes account into the linear relationship but discards non-linear relationship. However, the Spearman test considers both linear and non-linear relationship.

> cor(y1,x1,method = 'spearman')
[1] 0.8520012
> cor(y1,x2,method = 'spearman')
[1] 0.01749775
> cor(y1,x3,method = 'spearman')
[1] -0.9017702
> cor(y1,x4,method = 'spearman')
[1] 0.4252865

How we report?

Firstly, draw the plot to see the relationship.
If you want a statistical test:

draw the histogram to see if they have a normal/bell-shaped distribution
If yes, using a test with the Pearson method
If no, using the test with the Spearman method

It is worth noting that the statistical test is only an assisted tool for the relationship plot. Since in some cases, the result of the test is not reliable. For example, we could see a strong nonlinear correlation in the plot below. However, the Pearson coefficient and Spearman coefficient are both approximately 0.

Cited from

Besides, we could only say there is a correlation between variables and we could not get a conclusion that one variable is the causality of another variable. Specifically, if two variables have a large correlation of 0.9 and variable A has a high value, variable B will probably have a high value. However, we could not say high value in variable A causes the high value in variable B.

For more readings:

This blog specifically listed how to conduct other correlation test between two variables in R

This is a really good blog about the definition and it also contains a good video!

Why does Amazon always guess our preference? – explaining contextual bandit problem without mathematics

Ziyang Yang — Mon, 08 Feb 2021 13:36:37 +0000

This blog will give you an idea of the rationale behind the recommendation system. How contextual bandit problem works in such a system? Hope this blog will give you an answer.

During last semester, we are given a list of topics to discuss as a team. The fourth topic is bandit problem!

This is the two-arm bandit machine. Each time you have to choose to pull one arm to earn money. How will you do that? Which arm you will choose to pull? Probably try several times, and summarise some experience. Then you may have some rules to guide you to pull the arm.

This is the bandit problem which is clearly about how to make a good decision. In a two-arm bandit machine, it is to choose to pull which arm to earn more money. When it comes to the recommendation system, it is to choose the good news/products/videos to earn a more click-through rate!!!

Amazon’s secret – recommendation system

When you open your Amazon, you may notice it automatically recommends products for you. And when you using Tictok, it probability recommends videos that most attracts you. That is a recommendation system.

Judging by Amazon’s success, the recommendation system works. The company reported a 29% sales increase to $12.83 billion during its second fiscal quarter, up from $9.9 billion during the same time last year. A lot of that growth arguably has to do with the way Amazon has integrated recommendations into nearly every part of the purchasing process.

Amazon benefits from its recommendation system by recommending personalised products to different customers. You may have noticed that once you open Amazon, it shows the recommendation for you that you are actually interested in. Similarly, you may notice that Yahoo! recommends news you interests in, Tiktok always knows your tastes in videos. Although they may use a different algorithm, such personalized recommendation could be done by contextual bandit algorithms. A good recommendation system will always know you better than yourself !! Now, let’s look at what is contextual bandit problem through an example.

Looking at the contextual bandit problem through an example

Assuming we have a website called ‘click me’ posting interesting news, and we make a profit from the click-through rate on web advertising. A list of companies asked us to put their advertisements on our website. In order to maximize our profit, we want to personalize these advertisements and attract our customers to click. In other words, we want to show specific advertisements to specific viewers. But how? This is the bandit problem.

Collecting the contextual information

If we want to guess a person’s preference, we firstly want to know more about this person. Similarly, to our company, we want to know more about our viewers, which is called context in bandit problem. These contexts may contain:

Personal information: Gender, region, age, etc…
Recent browsing records and click-through records: Even including how many seconds you spend in viewing one advertisement
The preference of the categories of news: for example, our viewer may like the news of Justin Bieber, or they may focus on sales information.
etc…

Trying and learning how to guess

Okay dokey. Now we have lots of information about our viewers. What’s the next step? If you want to guess a persons’ favourite movie, you might want to show them some movies and observe their reactions. For example, if we show them ‘Titanic’ and they said they really love this movie, they probably like a romantic movie and we will show them more romantic movies to guess. If you show them ‘The Lion King’ and they said they do not like this movie, you will not show them more cartoon movies. (Just example, I love The Lion King!!!!)

Similarity, we have a list of advertisement from a list of companies. Which advertisement we choose to show for viewers with certain type?

Similarly, each time, our system will give them a type of advertisement (that is choose an action), and watch their reaction. If guess correctly, the machine will gain ‘rewards’ (that is you click the products), and such rewards will transfer to experience about this type of viewers. If guess incorrectly, the machine is ‘regret’ that do not guess viewers preference and try to guess again and again. After a long time, our machine could guess the preference of viewers correctly!

For example, for viewers age below 6 years old. When the machine shows the ads about toys, and children’s clicked that ad. The machine will gain experience that children are more likely to click ads about toys. And next time our machine is more likely to put an advertisement about toys on our website.

After a while and huge data, this engine has cumulated enough information about viewers preference and has a high probability to guess the preference – just like the process of learning (learn experience from success and try after failure!)

Now our company runs very well and could show certain advertisements to certain viewers! With a high click-through rate, we made lots of profit!!

Extended reading

This blog is only a general idea of multi-arm bandit problem, see more explanation including Maths please visit:

Learn From Your Mistakes – Multi-armed Bandits

See more references on contextual bandits and reinforcement learning in depth please visit:

And this video is really good to watch if you want to learn it at the beginning: