Katie Howgate

Hypergraphs – not just a cool name!

Katie Howgate — Thu, 29 Apr 2021 18:19:00 +0000

We are currently choosing our PhD project preferences and as they thought it might be useful, this week the MRes students were kindly invited to join the Networks reading group for a session which was an intro to Hypergraphs. Personally, I hadn’t come across hypergraphs before and I can really see the benefits of them. One of the PhD students, Amiee Rice, gave a great overview based on a paper (see references below). I had never considered how much information is clearly missing on a normal graph in some situations and hypergraphs are great ways to represent this information more accurately.

If you have never come across graph theory before, check out my previous blog post Graph Theory 101 for the basics.

What is a hypergraph?

We have a set of vertices $V$ and hyper-edges $E$ and a hypergraph is represented as a combination of these $H = (V,\,E)$, much like in a graph. However where as in a graph, an edge can only connect 2 vertices, in a hypergraph an edge can connect multiple vertices. For this reason they are often represented using a loop around all the vertices connected by the edge rather than a line connecting each edge.

The use of hypergraphs means we can display information about relationships that feature more than one object. For instance, the example Amiee gave was co-authorship, where we have a graph that represents authors of papers and when they collaberated together on a paper. Imagine have 3 authors A, B and C and the following graph, with the authors as the vertices and the edges representing collaboration on a paper.

While from this we know each author collaborated with both of the other authors at some point on a paper, we cannot determine if this was seperately on 3 different papers or perhaps all together on a single paper. We can however represent this in a hypergraph where the edge represents all the distinct collaborations on a single paper.

Just like with graphs, we can represent a hypergraph using an incidence matrix with the edges represented by the columns and the vertices as the rows.

We can also represent a hypergraph using a bipartite graph by making the hyper-edges into a separate set of vertices. This is sometimes referred to as an incidence graph.

A lot of the graph theory terminolgy is similar for hypergraphs also. Examples of the basics are:

A path is an alternating sequence of distinct vertices and hyper-edges. Start with an inital vertex $v_1$ and then move to another vertex $v_2$ within the hyper-edge they are both contained in $e_1$, then move to another vertex $v_3$ within a different hyper-edge $e_2$ that the vertex $v_2$ is also connected to and so on.
A cycle is a path that starts and ends on the same vertex.
The degree of a vertex is the number of hyper-edge it is contained in.
We call a hypergraph r-uniform if each vertex has degree $r$.
A hypergraph is simple if no edges contain other edges as a subset. So if you had two edges on a hypergraph $e_1=(v_1,v_2)$ and $e_2=(v_1,v_2, v_3)$ then this would not be simple because $e_1$ is a subset of $e_2$.
A hypergraph is connected if, for any two vertices, there exists a path between them.
A k-complete hypergraph is a r-uniform hypergraph where, for each set of $k$ vertices, there is a distinct hyper-edge containing those $k$ vertices. We may refer to it as just complete if it is k-complete for some $k$.

Examples of use

Recommender systems, for example, music recommendation such as in this paper . In the paper, they model the recommendation problem as a ranking problem on a unified hypergraph. A unified hypergraph is a hypergraph that has vertices and hyperedges that can be of different types (for example, some vertices represent users and some vertices represent songs). They propose an algorithm that creates music recommendations using this type of hypergraph. The hyper-edges are relations among objects in a music social community, one set of such hyperedges are similarities in tracks based on acoustic signals. While the vertices are these objects in a music social community themselves.

Image retrieval searching and retrieving images from a databases of images. The following paper again uses a unified hypergraph with images as vertices and different types of relations in individual modality are different types of hyper-edges, that is things such as image content, user-generated tags and geo-locations.

Competition networks a hypergraph can be used to indicate competition within the food chain. Different species are the vertices and those that are in competition for a certain type of food (this food is usually another species) are contained within the same edge. This is included in the Complex Networks as Hypergraphs paper linked in the references below.

References

Complex Networks as Hypergraphs – Ernesto Estrada, Juan A. Rodriguez-Velazquez

Voloshin, V. (2009). Introduction to graph and hypergraph theory. Hauppauge, N.Y.: Nova Science.

Graph Theory 101

Katie Howgate — Tue, 27 Apr 2021 18:10:00 +0000

If I mention a graph most of us first think of a scatterplot or a bar chart or something we could create in Excel. However, in mathematics, when we speak about graphs and graph theory this usually relates to a structure which is used to represent relationships between objects.

So what does that mean exactly? Well a graph is an ordered pair $G = (V,\, E)$ where

$V$ – a set of vertices (aka nodes).
$E$ – a set of edges, each edge connects 2 vertices (though it can be the same vertices if we allow loops). So an edge can be written in the form $e_1 = (v_1, v_2)$ where $v_1,\,v_2$ are in $V$.

For example,

We refer to the number of edges that are attached to a certain vertex (or incident to it) the degree of that vertex. For example, in the graph above vertex 1 has degree 2 and vertex 2 has degree 1. If all vertices have the same degree this is known as a regular graph.

We can introduce some more terminology and examples:

Loop – an edge that connects a vertex to itself (e.g $e_1 = (v_1,\,v_1)$ for some vertex $v_1$ in $V$)
Multiple Edges – if two vertices have more than 1 edge connecting them (e.g $e_1 = (v_1,\,v_2)$ and $e_2 = (v_1,\,v_2)$ for some vertices $v_1,\,v_2$ in $V$)
Simple graph – a graph with no loops or multiple edges.

Path – a sequence of vertices in which each vertex in the path is connected by an edge to the next vertex in the path. If no vertices in a path are repeated this is called a simple path. A path has no repeated edges.
Cycle – a path that starts and ends with the same vertex and not visit any other vertex more than once (apart from the start/end vertex).

��Ƶed graph – a graph is connected if, for any two vertices, there exists a path between them.

Complete graph – a graph is complete if, for any two vertices, there exist an edge that connects them.

Matrix representation

We can also describe these graphs using matrices.

The incidence matrix is used to record which edges connect up which vertices. The columns of the matrix represent the edges and the rows represent the vertices. If a vertex is joined to another by an edge then the matrix will have a 1 in the appropriate position in the matrix and a 0 otherwise.

The adjacency matrix is used to record which vertices are joined by an edge. Both the columns and rows represent the vertices. If two vertices are joined then the matrix will have 1 in the position that corresponds to that pair and a 0 if the vertices are not joined.

Bipartite graph

A bipartite graph is a special type of graph where the vertices are split into two sets. No two vertices in the same set can be connected by an edge. Instead edges join a vertex from each set together. If you have a bipartite graph then it is 2-colourable, that means that you could colour in the vertices in such a way that no two vertices connected by an edge have the same colour and you could do this using only 2 distinct colours.

Directed graphs

We can add direction to the edges on a graph. This can be used to represent flow or This means the order in which you list the vertices when writing an edge matters i.e in a ordered graph $ e_1 = (v_1,\,v_2) \neq (v_2,\,v_1) = e_2 $.

While the adjacency matrix for undirected graph are always symmetric, it is slightly different for directed graphs so it is likely that it is not symmetric. Rather than putting a 1 in the row that corresponds to both vertices that are joined by an edge, for directed graphs if the direction of the edge is going from some vertex $i$ to another vertex $j$, then we would put a 1 in row $i$ column $j$. The entry in row $j$ column $i$ would be 0 unless there was an additional edge that went from vertex $j$ to another vertex $i$.

Examples of use

As graph theory looks at the relationships between things it can be very useful to display things where we have these. One interesting examples is social networks – we can create graphs displaying the facebook friends with the vertex being a persons profile and the edge representing which profiles are friends with each other. In fact, graph theory is actually behind search engines such as Google, this uses weighted edges and an algorithm called PageRank. All webpages are the vertices and each hyperlink on the webpage is an edge connecting them. Each page will recieve a rank based on the quality of the hyperlinks that are connected to it, the number of hyperlinks etc. When a user searches a page it will match this search query up with pages displaying details that it believes are equivalent and will show pages with the highest rank first. For more information on how PageRank works check out the original paper in the references section below and I’ve also linked a great blog post that I found which explains how it works in more detail along with some other interesting information.

References

Paper on Googles PageRank system –

Google PageRank blog post

Which treatment would you prefer? Simpson’s Paradox

Katie Howgate — Fri, 16 Apr 2021 12:01:00 +0000

Let’s pretend we have collected some data. It’s from two separate groups but is measuring the same thing. We draw some conclusions on the data as a whole but then decide to look at the individual groups. From this we find when looking individually, we draw the exact opposite conclusions for both the groups. Seems weird, right? Surely that can’t happen! Well it can and this is known as Simpson’s paradox.

Simpson’s paradox is not limited to two groups of data, we can have many more and these groups could be things such as age groups, species, gender.

Kidney Stones

An interesting real example is one from a medical study which looked at different kidney stone treatments and how successful they were for different sizes of kidney stones. One of the treatments was a new less invasive treatment and the other was the current treatment. The findings for the success of the two treatments is as below. The percetage is the success rate and in the brackets we have (number of successes / total cases).

When looking at the treatments for small stones and large stones individually, it is clear that the conclusion you would draw is treatment A is more successful. However, when you combine these and look overall, you would think treatment B was more successful.

Here a larger proportion of those with small stones received treatment B while a larger proportion with large stones received treatment A. If these had equal proportion then we would not have seen these results which follow Simpson’s paradox. The size of the kidney stone has greater influence on the success of a treatment than the choice of treatment.

This brought to light that perhaps it is important to consider the size of kidney stones when testing treatments. It was not previously known to be an important consideration until they found this contradictory conclusion in the study.

It definitely highlights that you should not make overall conclusions without further analysis. There could be a number of factors that are being ignored when you make casual interpretations based on a summary. There are many other examples where Simpson’s paradox has been observed, such as comparing batting averages for baseball players over different years vs all the years together, and looking at gender bias in admissions to the University of California graduate school when split by department vs overall. Simpson’s paradox has even been seen within the COVID figures, looking at the Case Fatality Rates for China and Italy it appeared that Italy had a higher survival rate but when split by age demographic the opposite conclusion can be made. Check out this great video which discusses this.

References

Playing with emotions to increase film revenue

Katie Howgate — Thu, 08 Apr 2021 22:45:00 +0000

Is there a way that a screenwriter can to choose to write their film to increase the chances of it being more profitable? A paper recently published within the Journal of the Operational Research Society, (Del Vecchio et al [2020]) looks at using emotional arcs of films to drive product and service innovation in entertainment industries. Their hypothesis was that a particular type of emotional arc can be more profitable, perhaps in certain genres or just overall. Looking at the emotional arc is just a way of quantifying the highs and lows of a film, so things such as when there is a very sad scene or a extremely happy ending.

They began by extracting files containing subtitles from and removing duplicates (choosing the most popular version where they existed). They also attached information from IMDB, movie revenues and budgets to use within later analysis.

They split the subtitles into sentences and analysed each word in the sentence, giving it a “sentimental value” (1 if emotionally positive, 0 if emotionally neutral and -1 if emotionally negative) based on a lexicon that was developed in the Nebraska Literary Lab. From this an overall sentence sentiment value for the sentence was found so-called between [-1,1]. So direct comparison could be made between films of different lengths, a representive sub-sample of these sentiment values was taken with 100 elements as to represent the % of time within a film. This set of 100 sentimental values throughout time is refered to as the emotional arc or trajectory or the film.

Previous work on a similar area had already found that novels can be partitioned into 6 clusters based on their emotional arcs:

Rags to Riches – an ongoing emotional rise. Examples in film are The Shawshank Redemption and The Nightmare Before Christmas.
Riches to Rags – an ongoing emotional fall. Examples in film are Monty Python and the Holy Grail and Toy Story 3 (this suprised me).
Man in a Hole – a fall followed by a rise. Examples in film are The Godfather and The Lord of the Rings: The Fellowship of the Ring.
Icarus – a rise followed by a fall. Examples in film are Mary Poppins and A Very Long Engagment.
Cinderella – a rise-fall-rise pattern. Examples in film are Babe and Spider-man 2.
Oedipus – fall-rise-fall pattern. Examples in film are The Little Mermaid and As Good as It Gets.

Within their paper they argued that many novels are made into films and from this they hypothesised that films would follow the same 6 clusters as novels. Using this assumption of 6 clusters that the sentiment data generated from the subtitles they used a k-means clustering approach to partition their list of films into the aforementioned groups.

k-means Clustering

k-means clustering involves grouping data into $k$ clusters based on the distance from the mean. Here is a simple step by step overview of how a k-means clustering algorithm works.

Choose an $k$ arbitrary points within the data. These will be the inital clusters.
For each of the remaining points measure the distance to each of the $k$ clusters and assign each point to the cluster it is nearest to.
Find the mean of each cluster and, for all the points, find the distance from each of these means. Asign each point to the cluster whose mean the point is closest to. This is now the new set of clusters.
Repeat the process in step 3 of finding the mean of each cluster and reassigning the points to the cluster whose mean it is closest to until the clusters do not change.
As this is dependent on the chosen starting points, steps 1 to 4 is repeated a chosen number of times.
The variance for each cluster is measured each time and the final clustering is chosen to be the one which gives the smallest variance.

k-means clustering performed with 3 cluster centers (means)

One thing that needs to be considered is how we choose the value of $k$. Unfortuantely there is no automatic way to choose this therefore usually the best way is to try different values of $k$. Starting with $k$=1, do the k-means clustering algorithm and record the total variance for each $k$. The variance should begin as just the variance of the whole dataset for $k$=1 and will decrease as $k$ increases. However often this decrease in variance will begin to slow down more rapidly after a certain value of $k$. This is the value of $k$ you would then choose to be your number of clusters. This method is known as the elbow method as if you plot the variance for each value of $k$, there is often a distinct elbow shaped bend at the “optimal” number of clusters.

Here this was explained for individual data points, however k-means clustering can be done on a set of points. Within the paper being discussed here the approach was to find the distance between two sets of points at each time point and minimise the total off this.

Use within the paper

For two points at the same time percentage value $t$, the distance was calculated by using Simpson’s rule to approximate the $L_2$ metric, that is: $$ ||X_i(t) – X_j(t)||^2 = \sqrt{\frac{1}{\omega(t)dt} \int |X_i(t) – X_j(t)|^2 \omega(t)dt} $$ with $\omega(t)\equiv1$.

The justification given for using k-means clustering here was:

It is one of the most popular clustering techniques.
Allows the user to obtain meaningful intuition regarding the data structure.
Assumes spherical shapes of clusters which was found in previous natural language processing research.
More intuiative compared to other methods.

They tried various numbers of clusters (4, 6, 8, 10, 12) and found that the films did seem to fit optimally into the 6 partitions so their hypothesis was true. Below are the graphs showing each of the clusters found.

Rags to Riches
Riches to Rags
Man in a Hole
Icarus
Cinderella
Oedipus

A number of Ordinary Least Squares (OLS) regression models for each of the emotional trajectories were used to determine if differences in revenues was statistically significant. OLS regression involves fitting a linear line equation to the data which minimises the sum of the squared deviations (a.k.a the error or residual) of each data point from the line. This is basically just finding the line of best fit.

Findings

Some of the interesting findings from the paper are:

While Man in a Hole films tended to have the highest revenues, they tended to have lower IMDb ratings than other arc types leading to the belief that these films did well not necessarily because they were the most like but rather because they were the most talked about as they are often unusual and spark debate.
The financial success of Man in a Hole emotional arc films is not due to them falling within any particular budget category.
Generally a Riches to Rags emotionally arc is the least financially successful one however when they are in the high budget category they seem to generate statistically significantly high revenue.
While Icarus films tend to be financially unsuccessful irrespective of the genre, again Man in a Hole films tend to generate high revenues across most genres.

There are a lot more in depth findings included within the paper, for these and the whole methodolgy and driving factors see the references at the bottom of this page.

So it sounds to me that if you’re currently writing you’re next blockbuster, maybe consider taking the viewer on a to an emotional low followed by an emotional rise as in the Man in a Hole emotional arc and maybe don’t spend too much money as it doesn’t matter either way. While I think it is interesting idea to plan a storyline around this emotional journey, if all films were made to follow this Man in a Hole pattern it could be that viewer became bored of this so it’s not likely that a recommendation to only do this is smart.

References

Marco Del Vecchio, Alexander Kharlamov, Glenn Parry & Ganna Pogrebna (2020) Improving productivity in Hollywood with data science: Using emotional arcs of movies to drive product and service innovation in entertainment industries, Journal of the Operational Research Society, DOI:

K-means clustering,

Travelling Salesmen Problem… with a Drone

Katie Howgate — Fri, 26 Mar 2021 23:33:00 +0000

The travelling salesmen problem is a pretty old problem, it was studied back in 1930 and continues to play a key role in operational research for things such as vehicle routing. The problem is as follows:

Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?

Swapping cities for various alternatives means this type of problem fits many different scenarios and the modern day continues to add further complexities. One of which I think is pretty interesting is the addition of a drone to a vehicle routing scenario. Rather than replacing the vehicle with the drone, the drone acts as sort of a sidekick (some papers even refer to it as so).

A drone alone can only carry lighter packages and has a finite battery life that determines maximum flight time and distance. Therefore instead of having the drone only deliver light packages that are close to the depot, the drone will keep meeting up with the delivery van en-route to collect more packages (one package at a time) and replace batteries. This not only increases the range the drone can reach but means you can continue to have the drone delivering while the van is also and therefore greatly reduce the total delivery time.

The idea was first proposed in 2015 and there are currently various methods optimally planning this. One such method, proposed by El-Adle et al. uses Mixed Integer Programming.

What is Mixed Integer Programming?

Mixed Integer Programming is a type of Linear Programming. It is used to find the “best” solution for a particular objective given a set of limitations known as constraints.

The goal is usually find a solution which either minimises or maximises some linear objective function subject to the set of given linear constraints. The constraints can be a mixture of equalities and inequalities.

A common form of a linear equation is: $$ \begin{align*} \max \,\,\, & c^T x \\ \text{subject to} \,\,\, & A x \leq b \\ & x\geq0 \end{align*} $$ where $x$ is a vector of variables which vary to find the best solution (known as the decision variables), $c$ is the vector of coefficents of the decision variables within the objective function, $A$ is the matrix of coefficents of the decision variables within the constraints and $b$ is the vector of bounds on the constraints.

The mixed integer element comes from adding constraints to the problem which ensures some (but not all) of the variables in the solution much be integers.

The usual method for solving Mixed Integer Programs is called Branch and Bound.

Branch and Bound

Finding the solution to a Linear Program where all the decision variables don’t necessarily need to be integers is much simpler than when some or all do. There is well known algorithm known as the revised simplex method which can be used do this. The Branch and Bound method uses relaxed versions of Integer or Mixed Integer Programs where variables that are constrained to be integers are allowed to be non-integer and iteratively the method moves towards integer solutions, eventually finding the optimal option. For a Mixed Integer Program (MIP), the steps are as follows:

Start at node 1. Find the optimal solution to the relaxation of the MIP and calculate the objective function value at this optimal. If the variables which are constrained to be integers are in your found optimal then you can stop here.
Choose a variable which you want to be integer in the final solution but currently isn’t and branch on this. This means resolving for the optimal twice but keeping this variable fixed in both instances. In one instance (node 2), the chosen variable is fixed at the “floor” of its value in the current optimal solution (i.e rounded down to the nearest integer). In the other instance (node 3), the chosen variable is fixed at the “ceiling” of its value in the current optimal solution (i.e rounded up to the nearest integer).
Choose the solution with the highest objective value for the optimal solution at that node to branch a second time on. This keeps the variable which was set in step 2 fixed to the value at the node you branched on but repeats the process for another non-integer value in your current optimal at that node which you wish to be integer.
Continue branching on the node which has the highest objective function value until you have a solution which fulfils the mixed integer constraints and all nodes have a objective function value which is less than or equal.

Here is an example of the Branch and Bound algorithm being used. Slide mouse over image to pause the slideshow. The $x$ and OV stated at each node is the optimal solution and objective function value respectively at that stage in the Branch and Bound.

While for simple problems you may want to solve these by hand, in reality and especially for the travelling salesmen problem with drone an optimization solver such as Gurobi can do this much more efficiently for you.

How can the Travelling Salesmen Problem with Drone be formulated?

A paper recently published in the Journal of the Operational Research Society by El-Adle et al formulated the problem as a MIP with elements as described in this section .

Objective

Minimize the return time of both drone and vehicle to the depot.

Decision variables

Integer variables

For each of the following points below a set of binary decision variables are created to indicate:

If the vehicle travels from a location i to a location j.
If the vehicle travels from a location i to a location j.
If the vehicle flies from a location i to a location j.
If the vehicle visits a location j.

These are set to 1 if the event occurs and 0 if not. Here you’ll see that the second and third set of decision variables appear to be the same however the drone may be transported from one location to another aboard the vehicle. In this case, for that pair of location, the “travelling” variable would be set to 1 however the “flying” variable would be set to 0.

Non-integer variables

For each of the following points below a set of decision variables are created to indicate:

The arrival time of the delivery carrier at location j.
The departure time of the delivery carrier at location j.
The time spent by the vehicle waiting for the drone at location j.

Constraints

One or more constraints is used to ensure the following hold:

The vehicle departs the depot for exactly one location.
The drone departs the depot for exactly one location.
If the vehicle does not visit a location then the drone does.
If the drone travels from one location to another, then the vehicle must visit at least one of those.
If the drone travels from one location i to another location j and the vehicle also visits both those locations on its full journey then it must travel from i to j.
The drone cannot be transported aboard the vehicle at the same time that it flies.
If the drone “flies” then it also “travels”.
The drone cannot fly for a duration more than it’s capacity before meeting with the vehicle.
If a carrier travels from one location to the next than it cannot arrive before the departure time from the last location plus the travel time to the new location.
A carrier cannot depart before it arrives.
Whenever the drone departs a location later than the vehicle, the vehicle’s waiting time is equal to the difference.

This makes up a much long lists of constraints and decision variables than the traditional Travelling Salesmen Problem as it is much more complex. However with the way technology is moving forward, it definitely has its place. There are a number of other approaches to solve the Travelling Salesmen Problem with Drone and further adaptations too. For instance, adding multiple drones futher complicates the problem but can speed delivery times up considerably.

For the paper where they detail the full Mixed Integer Program in mathematical terms and give a number of further model enhancements see the reference below.

References

Amro M. El-Adle, Ahmed Ghoniem & Mohamed Haouari (2021) Parcel delivery by vehicle and drone, Journal of the Operational Research Society, 72:2, 398-416, DOI:

Wikipedia: Travelling salesman problem,

Fraud in the 2020 US Election?!?!

Katie Howgate — Fri, 12 Mar 2021 12:25:00 +0000

We all know that both during and after the 2020 United States presidential election Donald Trump threw a number accusations of fraud and corruption Joe Biden’s way. One method that has been used in the past to detect fraud in elections is something known as as Benford’s law. In 2009, it was used as evidence of fraud in the Iranian elections and then in 2020 it was found that Benford’s law did not hold for Biden’s votes and therefore many saw this as proof of fraud for Biden’s party… but was it really?

What is Benford’s law?

According to Benford’s law, if you have a large amount of real data you should expect the leading value (first non-zero digit) to be 1 more than any other integer. In fact it should actually be 1 around 30% of the time. The likelihood of a leading digit $d$ occuring decreases as $d$ increase.

Here this chart shows the probability of a number having any of the possible leading digits.

Benford’s law can be used when analysing a large amount of random numerical data to detect various types of fraud. Interestingly, using Benford’s law it was found that the Greek government reported what is very likely fraudulent data to the EU in order to enter the eurozone, however this was only discover later on.

In addition to analyzing the first digit, Benford’s law can also be applied to the second digit and the first and second digits together.

Why is Benford’s law true?

A simple way to think about why Benford’s law holds is to think about it in terms of increasing earnings. If you’re earning £1000 and want to increase this to £2000, that is a 100% increase so you’ll have a leading digit of 1 for a while. However if you are earning an amount of money with a larger leading digit such as £9000 then to get to £10,000 it is just a 11.1% increase so you’ll be able to increase that a lot quicker. This is repeated at the various orders of magnitude (i.e the power when a number is written in standard form, this is often just the number of digits) and leads to having leading values for the proportions expected under Benford’s law.

Benford’s law only holds true when the data spans multiple orders of magnitude. For instance, if the data you have is metric height of adults then it does not hold. This seems obvious when you think about how the majority of adults will have height a leading digit of 1 or 2 (eg. between 1m and 2.99m) and most likley no adult human will have a height with a leading digit of 3 (height of 30cm or 3m – I checked and according to google these have not occured).

What about the 2020 US election?

When trying to prove fraud there were multiple voting districts which were identified such as Chicago and Milwaukee. In Chicago votes were recorded for 50 wards which are split up into smaller areas known as precincts (there are 2069 precincts in total though 15 of these had no votes so were excluded). The leading digit of the number of votes per candidate was recorded for each precinct and then you can plot the frequency for each of the possible leading digits compared to the frequency you would expect under Benford’s law. Here are graphs of these for Biden and Trump.

You can see the number of votes Trump recieved in each precinct follows Benford’s law relatively closely. For Biden, however, this is definitely not true which lead many to speculate that this was proof of fraud within the election.

With a deeper look into the data it’s easy to see that there is definitely not a range of magnitudes. The maximum and minimum number of votes in each precinct was 39 and 1418 respectively and the majority of the precincts had vote counts that were a 3 digit number. Therefore you would not expect Bedford’s law to hold and can draw absolutely no conclusions from the fact it does not for Biden.

There continues to be a lot of debate as to whether using Benford’s law to detect fraud in election results is correct though, while in the US 2020 election it was misused, the 2009 Iranian election findings have not been contested. It seems logical that Benford’s law being used to indicate when data isn’t appearing as one would expect can be useful, however this situation proves as a reminder to be cautious when making assumptions. It may also be that in other scenarios, a devation from Benford’s law shows interesting patterns in behaviour rather than outright fraud.

References

Wikipedia, Benford’s law,

Chicago data –

How to use ggplot2

Katie Howgate — Sat, 27 Feb 2021 19:41:00 +0000

Up until now my graph plotting skills in R have been severely lacking! Usually consisting of a lot of googling until I find an example that is as similar as possible to what I want and amending to fit my data and task. I thought therefore that I’d teach myself by creating this post and hopefully it is useful to others.

The package that I’ll be looking at here is ggplot2 which is a data visualisation package and seems to be the main package used for creating many different types of graphs with a good degree of customisation freedom and a professional look. After looking into the motivation behind each of the elements in the code which gets used, ggplot2 is actually quite intuative. My issues previously seem to stem from the fact that I had never seen the compilation of elements broken down into their components and as a whole I found this package pretty confusing.

Creating a plot requires two stages to be completed. First you set up the plot to create a blank skeleton of a plot and then you add layers to this to add content and features to the plot such as data points and titles.

Set up the plot:

First we start with the ggplot() function and within this the most important arguments are

data – self explanatory, this is just the data that you are wanting to plot.
mapping – this will be in the form aes(x,y) where x will be used to scale the x-axis and y will be used to scale the y-axis.

Any plot created using the ggplot2 package must start with the ggplot(), though you can leave all the arguments blank and specify the data and mapping in the layers. You might do this if you are using multiple dataframes to produce different graphics on the same plot. However if you are only using a single dataframe and consitant mapping then you specify them in the ggplot() function.

Adding layers:

We add layers to our ggplot() function using the + symbol. These layers add the actual content of the graph. Some examples of layers we can add are:

Geom_point() adding this to our ggplot without any arguments adds a scattergraph of the points from the data specified in the ggplot() function.
Geom_smooth() we can add a line of best fit for the data given in the ggplot() step to our scatter graph by adding this without any arguments.

The possible arguments for each layer but while adding a layer without arguments means it will just use the data specified in the ggplot() part, if you want to use a different set of data this is where you could do it. This makes it easy for comparisons between different variables or sets of data.

Let’s see a step by step build up of a plot:

I’m using the covid19 package to access datasets relating to the pandemic, I thought it would be interesting to compare vaccination rates for a few countries. First I extracted the data using the COVID19 package for my chosen countries and due to the data available I had to create the “Percentage Vaccinated” column from the number vaccinated and the population.

COVIDSubset <- covid19(country=c("United Kingdom",'US','Italy','Spain'))



COVIDSubset$PercentVaccinated <-  100*COVIDSubset$vaccines/COVIDSubset$population

Now I can begin creating the plot. First I start with the ggplot() basis with my data and chosen x and y variables. This creates an empty plot as there are no layers yet so there is no data to display. The use of the colour arguments in aes() will mean when I add layers these are split into categories based on the variable I put in there and are coloured differently due to this. The software will automatically create a legend for me once layers are added but this is missing for now. Here I have chosen to split the data by country using the id column in the data.

ggplot(data=COVIDSubset, aes(x=date,y=PercentVaccinated, colour= id))

Next I’ll add a single layer just to show how it can build up. Given the data I want to look at I have chosen the geom_line() layer

ggplot(data=COVIDSubset, aes(x=date,y=PercentVaccinated, colour= id)) + geom_line()

Due to missing data causing breaks in the line I’ve decided to swap to use geom_smooth() instead of geom_line() to get a line of best fit. Now I’ll add more layers, these are:

labs() for axis labels and legend label (this is set using the colour arguments).
xlim() to reduce the range of the x-axis to a more suitable range.
scale_colour_discrete() to rename the labels for the legend.
ggtitle() to set a title
theme() with plot.title = element_text(hjust = 0.5) arguments to centre the title as the default has it left aligned.

ggplot(data=COVIDSubset, aes(x=date,y=PercentVaccinated, colour= id)) + 
  geom_smooth() + 
  labs(x="Month", y="Percentage vaccinated (%)", colour="Country") + 
  xlim(as.Date(c("2020-12-01", "2021-03-16"))) +
  scale_colour_discrete(labels = c("Spain", "UK", "Italy","USA")) + 
  ggtitle("Percentage of Total Population Vaccinated by Country") +
  theme(plot.title = element_text(hjust = 0.5))

Looks like we are doing quite well in the vaccine game in comparison to Spain and Italy with the US doing alright but not quite as well. This plot was nice and easy to make and make small changes to. There are many interesting styles of plots that can be created using the ggplot2 package beyond just line graphs however this seemed like a good start. The ggplot2 link in the references below has an extremely useful cheatsheet which details most of the possible layers you can add and options you may have within them. The key idea here is to remember it is all about adding the layers and building the graph up. You need to think of what you want to see graphically and then break that down into the steps that will be needed to build it.

References

ggplot2 –

COVID19: R Interface to COVID-19 Data Hub –

A Bayesian Approach To Finding Lost Objects

Katie Howgate — Mon, 08 Feb 2021 19:03:00 +0000

Ever wondered how they find lost objects in the sea? For example, when a plane such as Air France Flight 447 here goes missing they must have some approach to search efficiently. While there is a small hope of finding survivors of a plane crash, it very useful to find the black box so they can find the cause of the crash and ensure if possible that it doesn’t happen for the same reason in the future. In cases such as this something known as Bayesian Search Theory is used.

What is Bayesian Search Theory?

Bayesian Search Theory applies Bayesian statistics to a search problem in a more efficient way than just randomly searching all of the possibilities. It allows for an updated view on where you are most likely to find an object as you progress through the search. It has been used to find various lost sea vessels such as the USS Scorpion, to help recover flight recorders in the case of Air France Flight 447 and to attempt to locate the remains of MalaysiaAirlines Flight 370.

USS Scorpion

How is it applied?

Formulate as many possible scenarios of what could have happened to the missing object using knowledge such as the last known position and the time it was lost. This will also involve a weighting for which scenarios are more likely.
For each scenario, assign a probability to the search space of the lost object being in each possible place.
An additional probability is assigned based on the likelihood of finding the object in a possible place, given it actually is there.
Combining these probabilities gives the probability of the object actually being found if search takes place in a certain area . Areas with highest probability of finding the object are searched first.
As the search progresses, the findings (or more appropriately the lack thereof) are used to update the probability that an object will be found in certain area. This is done using Bayes Theorem.

An example

Let’s pretend I dropped my phone in a small perfectly square lake (I don’t think these exist but it makes the example much simpler). I remember using it last when I was on the boat near the centre of the lake and I am certain I must of dropped it when trying to put it back into my pocket. I did drift towards the left edge of the lake after using my phone but I think it’s more probable that it was dropped towards the centre. I’m keeping it simple here with a single possible scenario.

I’ve split the lake up into a 4×4 grid so I’m looking to find which square on the grid to search first. From my assumed scenario of how my phone was lost, I’ve created a probability density map of the space I am searching for the probability my phone has been dropped in a certain square. I’ve also looked into the depth of the lake to create a probability map of how likely I am to actually find my phone in that square given it is there.

Probability density map for where I dropped my phone (based on my knowledge of where I dropped it.)

Probability map for how likely I am to find my phone in each square given it is there (based on the hypothetical depth of the lake in each square.)

Combining these gives

Total probability of finding my phone in each square of the grid.

Now I have a map of where to start searching, it looks like the square on the second row up and third column along would be the best option so I’ll search there first.

Unfortunately my phone isn’t there so I’ll use Bayes theorem to update the probability in that square. Applying Bayes theorem and a bit of jiggery pokery with probability identities gives:

Where P[Is There] and P[Found | Is There] are the probabilities in our two grids originally created. So I can update the probability for that square to be (rounded to 2 decimal places):

Therefore I update this square to have a probability of 0.03 that my phone is there. I also revise the probabilities for all other squares that my phone is there given it wasn’t found in that first square . For these squares the calculation is

This is then combined with the probability of finding the phone given it is there so I have a revised probability map for the most likely places you will find the actually phone:

Revised probability that my phone is in a certain square.

Combining this with the probability of finding the phone given it is in a that square.

Total probability of finding my phone in each square of the grid.

From this I would then move on to search the square with probability of finding the phone of 0.13 and repeat this process until the phone is found.

Practically, always going for the square with the maximum probability may not be the most efficient method because you may have to travel all across the area and it might be more efficient to search areas you are travelling through as you go. After the initial probabilities have been found for the area, a search plan will likely be created for your planned search journey. This may not necessarily have you always searching the areas in descending probability order. However with Bayesian search theory applied you can get a useful indication of where would be wise to search and when to amend your search plan.

References:

How to build a credit scorecard

Katie Howgate — Sun, 07 Feb 2021 11:18:00 +0000

I thought I’d write here about one of the things in my job that lead me to start thinking about statistics again and therefore to STOR-i. As a credit risk analyst, a large part of my job involved monitoring and adapting models for clients which incorporated scorecard models into them. I never actually got to build a real model myself but as a fun training exercise we had a model building competition

What is a credit scorecard?

Within industry, credit scorecards are used to assign a score to an individual which gives you a gauge on their predicted riskiness. This riskiness is based on predicting the probability of default which can vary in definition but in general will be a chosen set of criteria which indicate a customer is unlikely to pay. They are very useful as it’s simple to explain how to use them to a non-technical audience which is often key when boards of directors need to understand why decisions have been made or perhaps individuals wonder why their application for a loan has been declined.

A scorecard may generally look like the following:

To use this scorecard you’d go through each characteristic and assign the correct score for that individual, adding these up to get a total score. Just to note the scorecard above is extremely simple and was created on the fly by me so don’t take this to have any actual predictive power. A properly built scorecard would contain many more variables and may use some scaling which takes the raw score and transforms it to get a final score. A common scaling would be multiplying by 20/log(2) and adding 500 as this multiplier leads to a final score where an additional 20 points onto the score doubles the odds (this is the probability being predicted divided by 1 minus the probability).

Types of credit scorecards

There are two types of credit scorecards:

Application scorecards. These predict the probability an applicant will behave in a certain way based on data you would gather on application (for a loan for example). They are used in the decision process for acceptance or rejection for approval of the application but also may be used during the lifetime of the loan to judge if there has been a significant increase in risk since the application.

Behavioural scorecards. These are scorecards that are based on data gathered through the lifetime of the loan or mortgage. This can be things such as paying habits or changes in circumstances. In particular, I saw these used to predict the probability of default and the probability of redemption which were used as part of a calculation of expected losses (IFRS9 and IRB).

How can you build a credit scorecard model?

There are different methods that can be used to build a scorecard but I’ll talk you through the method I’m most familiar with:

Step one: Gather and clean your data.

If you are working with real life data, you can have erroneous data which you want to get rid of or change prior to trying to build a model with it. For example, if you have a case where there was a default date set whenever the data was missing, then you would have to appropriately deal with this if that date was going to be used in variable in the model (perhaps a “time since X happened” variable). Your data pull needs to include data from an observation point in time (this is the data you will build the model on) and data to determine an outcome (for probability of default models this is whether a default occurred within the outcome period or not).

Step two: Create any new variables

You might want to create some new variables which combine two characteristics or represent a history rather than just having point in time data. Here this would include deriving the an outcome to each observation (if this wasn’t easily found in the first step).

These outcomes are often referred to as:

Goods – if the event you are predicting didn’t occur.
Bads – if the event did occur (e.g a default).
Indeterminates – if you couldn’t determine whether the event occurred or not.

Step three: Split the data

Randomly split the data so you have a development sample to build the model with and a testing sample which you can use to test the model once it is built. A common split is 80% development and 20% testing.

Step four: Fine classing

Put your possible model variables into an initial set of bins. You want to keep this quite granular at this stage so you might have a large number of bins (perhaps up to 20). For example you could split a variable like property age into 5 yearly splits, so you’d have 0-5, 5-10 and so on with a bin at the end for anything exceeding your final bin value and another extra bin for missing values.

Step five: Calculate WoE and IV

For each bin of each characteristics you need to calculated a metric known as the Weight of Evidence (WoE). This is the natural log of the percentage of goods that fall into this bin for this characteristic from your sample divided by the percentage of bads that fall into this bin for this characteristic from your sample. This will be used in the next step but is also used to calculated the information value for each characteristic. This takes the difference in the percentage of goods and bads multiplied by the weight of evidence, summed up for each bin. Here are the formulas

Step six: Coarse classing

Based on similar WoE evidence values we can group together bins to create larger and less granular groups for each characteristic. At this stage we can also remove some variables from consideration based on their IV as a low information value will indicate low predictive power. It is good to keep recalculating the WoE and IV as you group bins together to keep checking you are making the right decisions.

Step seven: Choosing a dummy variable or WoE approach

Both approaches will yield similar results but in certain situations one will perform better than the other. For example the weight of evidence approach is good when you have a lot of categorical variables but the dummy variable approach is simpler.

Dummy

This involves splitting your coarse classed variables up so each bin has its own binary dummy variable which will take the value of 1 if an individual falls into that bin for the characteristics or 0 if not. In this case there will be a different coefficient assigned to each dummy variable.

Weight of Evidence

Keeps each characteristic as a single variable but for an individual the variable takes the WOE value corresponding to the bin they fall into. In this case there will be a single coefficient assigned to each variable

Step eight: Logistic regression

Now it’s time to build the model (Yey!). You can do this using programs such as SAS, R and Python. Logistic regression is used to model probabilities of events which have 2 possible outcomes and gives you a predictive model which looks like the following:

Here the left hand side of the equation is the log odds where p is the probability you are trying to predict.

Step nine: Check model against test data

Once you’ve built your model you can apply it to the test sample you kept aside in step three and see how well it is performing.

Step ten: Change it a scorecard format

If you decided to use a dummy variable approach then the score assigned to each bin for each characteristic is just the coefficient of the corresponding dummy variable. If you went with the WoE approach you will need to multiply the coefficient for that characteristics variable with the WoE for the bin you want the score of.

Scorecard monitoring

There are a number of standard metrics used to monitor the performance of a scorecard. A main one is (as one might expect) comparison of the actual rates versus the predicted rates, this monitors that the model is accurately predicting what it should be however in the case it is not a basic calibration can be performed to bring these more in line. This just involves a gradient and intercept shift to the distribution of scores however more complex methods can be used. Another largely used metric is the Gini coefficient to measures the discriminatory power of the model. This looks at the distribution of goods vs bads and will higher if more of the goods have higher scores and more of the bads have lower scores (so a Gini coefficient of 100% would be perfect discrimination – though not necessarily a perfect model).

Bootstrapping

Katie Howgate — Fri, 05 Feb 2021 18:24:00 +0000

Imagine you have a sample of data and you wish to calculate the mean. Now you can easily calculate the sample mean, but how do you know if that is indicative of the population mean or not? In a perfect world you could gather a much much larger sample including the entire population and accurately calculate this however in a lot of cases this is not possible and the original sample is all you’ve got. In this case we can use a technique known as bootstrapping.

What is bootstrapping?

Bootstrapping is repeatedly taking a random sample from your available sample (with replacement so you may sample the same data many times). On each iteration you select a random sample of a chosen size (often the same size as your original sample) and repeat this for many iterations. You can then calculate various parameters estimates for each sample and use the distribution of these to create a confidence interval. Bootstrapping is useful for non-parametric distributions where it may be difficult to calculate these otherwise.

Advantages and Disadvantages

Obviously bootstrapping is incredibly simple, it doesn’t involve any complex calculations or test statistics. You can use bootstrapping on much smaller sample sizes than traditional statistical methods (even sample sizes as small as 10 in some cases).

A key advantage is that bootstrapping doesn’t need you to make any assumptions about the data (such as normality), regardless of the distribution of the data you still bootstrap the data the same way and all you are using is the information you actually have. Using traditional methods, if you assumptions are at all wrong this can mean that the estimates you get are extremely far from the true values so bootstrapping can be more reliable in this case. That being said not all distributions will give accurate confidence intervals for parameter estimates when bootstrapping. Examples of situations where bootstrapping may fail are with distributions with high correlation or Cauchy distributions (which have no mean anyway).

Testing bootstrapping?

Here I’ve randomly generated a 50 values from a standard normal distribution to represent a sample I may take. I know it should have population mean of zero and standard deviation of 1 but if I use the bootstrapping approach will I a good confidence interval for these?

k <- 50
set.seed(666)
sample <- rnorm(k,0,1)

boot_sample_mean <- NULL
boot_sample_sd <- NULL
n <- 10000
for (j in 1:n){
  boot_sample <- NULL
  for (i in 1:k){
    x <- floor(runif(1,1,k+1))
    boot_sample <- c(boot_sample, sample[x])
  }
  boot_sample_mean <- c(boot_sample_mean, mean(boot_sample))
  boot_sample_sd <- c(boot_sample_sd, sd(boot_sample))
}

Using the code above I’ve created 10,000 new random samples of my original data and stored the mean and standard deviation for these. The 95% confidence intervals I got from this data:

Mean

Bootstrap estimation gives mean of -0.065 with a confidence interval (-0.371, 0.248)

Standard Deviation

Bootstrap estimation gives standard deviation of 1.117 with a confidence interval (0.936, 1.298)

Both confidence intervals contain the true values for the parameter it’s estimating, though the histogram of standard deviations in particular is shifted a little to the right meaning our generated samples perhaps tended to have more variability than the true distribution.