Outbreak Detection in Networks

Introduction

The general goal of outbreak detection in networks is that given a dynamic process spreading over a network, we want to select a set of nodes to detect the process efficiently. Outbreak detection in networks has many applications in real life. For example, where should we place sensors to quickly detect contaminations in a water distribution network? Which person should we follow on Twitter to avoid missing important stories?

The following figure shows the different effects of placing sensors at two different locations in a network:

sensor_placement (A) The given network. (B) An outbreak starts and spreads as shown. (C) Placing a sensor at the blue position saves more people and also detects earlier than placing a sensor at the green position, though costs more.

Problem Setup

The outbreak detection problem is defined as below:

where

The Reward can be one of the following three:

The Cost is context-dependent. Examples are:

Outbreak Detection Formalization

Objective Function for Sensor Placements

Define the penalty for detecting outbreak at time , which can be one of the following:Notice: in all the three cases detecting sooner does not hurt! Formally, this means, for all three cases, is monotonically nondecreasing in .

The objective reward function of a sensor placement is defined as penalty reduction:

where is the time when the set of “sensors” detects the outbreak .

Claim 1: is monotoneFor the definition of monotone, see Influence Maximization

Firstly, we do not reduce the penalty, if we do not place any sensors. Therefore, and .

Secondly, for all ( is all the nodes in ), , and

Because is monotonically nondecreasing in (see sidenote 2), . Therefore, is nondecreasing. It is obvious that is also nondecreasing, since .

Hence, is monotone.

Claim 2: is submodularFor the definition of submodular, see Influence Maximization

This is to proof for all :

There are three cases when sensor detects the outbreak :

  1. ( detects late): nobody benefits. That is and . Therefore,
  2. ( detects after but before ): only helps to improve the solution of but not . Therefore,
  3. ( detects early): Inequality is due to the nondecreasingness of , i.e. (see Claim 1).

Therefore, is submodular. Because , is also submodular.Fact: a non-negative linear combination of submodular functions is a submodular function.

We know that the Hill Climbing algorithm works for optimizing problems with nondecreasing submodular objectives. However, it does not work well in this problem:

Hence, we need a new fast algorithm that can handle cost constraints.

CELF: Algorithm for Optimziating Submodular Functions Under Cost Constraints

Bad Algorithm 1: Hill Climbing that ignores the cost

Algorithm

This can fail arbitrarily bad! Example:

Bad Algorithm 2: optimization using benefit-cost ratio

Algorithm

This can fail arbitrarily bad! Example:

Solution: CELF (Cost-Effective Lazy Forward-selection)

CELF is a two-pass greedy algorithm [Leskovec et al. 2007]:

Approximation Guarantee

CELF also uses a lazy evaluation of (see below) to speedup Hill Climbing.

Lazy Hill Climbing: Speedup Hill Climbing

Intuition

Lazy Hill Climbing Algorithm:

The following figure show the process.

lazy_evaluation

(A) Evaluate and pick the node with the largest marginal gain . (B) reorder the marginal gain for each sensor in decreasing order. (C) Re-evaluate the s in order and pick the possible best one by using previous s as upper bounds. (D) Reorder and repeat.

Note: the worst case of Lazy Hill Climbing has the same time complexity as normal Hill Climbing. However, it is on average much faster in practice.

Data-Dependent Bound on the Solution Quality

Introduction

Data-Dependent Bound

Suppose is some solution to subjected to , and is monotone and submodular.

Proof:For the first inequality, see the lemma in 3.4.1 of this handout. For the last inequality in the proof: instead of taking of benefit , we take the best possible element , because we do not know .

Note: