# Outbreak Detection in Networks

## Introduction

The general goal of outbreak detection in networks is that given a dynamic process spreading over a network, we want to select a set of nodes to detect the process efficiently. Outbreak detection in networks has many applications in real life. For example, where should we place sensors to quickly detect contaminations in a water distribution network? Which person should we follow on Twitter to avoid missing important stories?

The following figure shows the different effects of placing sensors at two different locations in a network:

*(A) The given network. (B) An outbreak starts and spreads as shown. (C) Placing a sensor at the blue position saves more people and also detects earlier than placing a sensor at the green position, though costs more.*

## Problem Setup

The outbreak detection problem is defined as below:

- Given: a graph and data on how outbreaks spread over this (for each outbreak , we knew the time when the outbreak contaminates node ).
- Goal: select a subset of nodes that maximize the expected reward:

where

- : probability of outbreak occurring
- : rewarding for detecting outbreak using “sensors” It is obvious that is the expected reward for detecting the outbreak
- : total budget of placing “sensors”

**The Reward** can be one of the following three:

- Minimize the time to detection
- Maximize the number of detected propagations
- Minimize the number of infected people

**The Cost** is context-dependent. Examples are:

- Reading big blogs is more time consuming
- Placing a sensor in a remote location is more expensive

## Outbreak Detection Formalization

### Objective Function for Sensor Placements

Define the **penalty ** for detecting outbreak at time , which can be one of the following:Notice: in all the three cases detecting sooner does not hurt! Formally, this means, for all three cases, is monotonically nondecreasing in .

**Time to Detection (DT)**- How long does it take to detect an outbreak?
- Penalty for detecting at time :

**Detection Likelihood (DL)**- How many outbreaks do we detect?
- Penalty for detecting at time : , this is a binary outcome: means we detect the outbreak and we pay 0 penalty, while means we fail to detect the outbreak and we pay 1 penalty. That is we do not incur any penalty if we detect the outbreak in finite time, otherwise we incur penalty 1.

**Population Affected (PA)**- How many people/nodes get infected during an outbreak?
- Penalty for detecting at time : number of infected nodes in the outbreak by time

The objective **reward function of a sensor placement ** is defined as penalty reduction:

where is the time when the set of “sensors” detects the outbreak .

### Claim 1: is monotoneFor the definition of monotone, see Influence Maximization

Firstly, we do not reduce the penalty, if we do not place any sensors. Therefore, and .

Secondly, for all ( is all the nodes in ), , and

Because is monotonically nondecreasing in (see sidenote 2), . Therefore, is nondecreasing. It is obvious that is also nondecreasing, since .

Hence, is monotone.

### Claim 2: is submodularFor the definition of submodular, see Influence Maximization

This is to proof for all :

There are three cases when sensor detects the outbreak :

- ( detects late): nobody benefits. That is and . Therefore,
- ( detects after but before ): only helps to improve the solution of but not . Therefore,
- ( detects early): Inequality is due to the nondecreasingness of , i.e. (see Claim 1).

Therefore, is submodular. Because , is also submodular.Fact: a non-negative linear combination of submodular functions is a submodular function.

We know that the Hill Climbing algorithm works for optimizing problems with nondecreasing submodular objectives. However, it does not work well in this problem:

- Hill Climbing only works for the cases that each sensor costs the same. For this problem, each sensor has cost .
- Hill Climbing is also slow: at each iteration, we need to re-evaluate marginal gains of all nodes. The run time is for placing sensors.

Hence, we need a new fast algorithm that can handle cost constraints.

## CELF: Algorithm for Optimziating Submodular Functions Under Cost Constraints

### Bad Algorithm 1: Hill Climbing that ignores the cost

**Algorithm**

- Ignore sensor cost
- Repeatedly select sensor with highest marginal gain
- Do this until the budget is exhausted

**This can fail arbitrarily bad!** Example:

- Given sensors and a budget
- : reward , cost
- ,…, : reward , cost ( is an arbitrary positive small number)
- Hill Climbing always prefers to other cheaper sensors, resulting in an arbitrarily bad solution with reward instead of the optimal solution with reward , when .

### Bad Algorithm 2: optimization using benefit-cost ratio

**Algorithm**

- Greedily pick the sensor that maximizes the benefit to cost ratio until the budget runs out, i.e. always pick

**This can fail arbitrarily bad!** Example:

- Given 2 sensors and a budget
- : reward , cost
- : reward , cost
- Then the benefit ratios for the first selection are: 2 and 1, respectively
- This algorithm will pick and then cannot afford , resulting in an arbitrarily bad solution with reward instead of the optimal solution , when .

### Solution: CELF (Cost-Effective Lazy Forward-selection)

**CELF** is a two-pass greedy algorithm [Leskovec et al. 2007]:

- Get solution using unit-cost greedy (Bad Algorithm 1)
- Get solution using benefit-cost greedy (Bad Algorithm 2)
- Final solution

**Approximation Guarantee**

- CELF achieves factor approximation.

CELF also uses a lazy evaluation of (see below) to speedup Hill Climbing.

## Lazy Hill Climbing: Speedup Hill Climbing

### Intuition

- In Hill Climbing, in round , we have picked sensors. Now, pick
- By submodularity for .
- Let and be the marginal gains. Then, we can use as upper bound on for ()

### Lazy Hill Climbing Algorithm:

- Keep an ordered list of marginal benefits from previous iteration
- Re-evaluate only for the top nodes
- Reorder and prune from the top nodes

The following figure show the process.

*(A) Evaluate and pick the node with the largest marginal gain . (B) reorder the marginal gain for each sensor in decreasing order. (C) Re-evaluate the s in order and pick the possible best one by using previous s as upper bounds. (D) Reorder and repeat.*

Note: the worst case of Lazy Hill Climbing has the same time complexity as normal Hill Climbing. However, it is on average much faster in practice.

## Data-Dependent Bound on the Solution Quality

### Introduction

- Value of the bound depends on the input data
- On “easy data”, Hill Climbing may do better than the bound for submodular functions

### Data-Dependent Bound

Suppose is some solution to subjected to , and is monotone and submodular.

- Let be the optimal solution
- For each let
- Order so that
- Then, the
**data-dependent bound**is

Proof:For the first inequality, see the lemma in 3.4.1 of this handout. For the last inequality in the proof: instead of taking of benefit , we take the best possible element , because we do not know .

Note:

- This bound hold for the solution (subjected to ) of any algorithm having the objective function monotone and submodular.
- The bound is data-dependent, and for some inputs it can be very “loose” (worse than )