Influence Maximization

Motivation

Identification of influential nodes in a network has important practical uses. A good example is “viral marketing”, a strategy that uses existing social networks to spread and promote a product. A well-engineered viral marking compaign will identify the most influential customers, convince them to adopt and endorse the product, and then spread the product in the social network like a virus.

The key question is how to find the most influential set of nodes? To answer this question, we will first look at two classical cascade models:

Then, we will develop a method to find the most influential node set in the Independent Cascade Model.

Linear Threshold Model

In the Linear Threshold Model, we have the following setup:

The following figure demonstrates the process:

linear_threshold_model_demo

(A) node V is activated and influences W and U by 0.5 and 0.2, respectively; (B) W becomes activated and influences X and U by 0.5 and 0.3, respectively; (C) U becomes activated and influences X and Y by 0.1 and 0.2, respectively; (D) X becomes activated and influences Y by 0.2; no more nodes can be activated; process stops.

Independent Cascade Model

In this model, we model the influences (activation) of nodes based on probabilities in a directed graph:

Note:

Influential Maximization (of the Independent Cascade Model)

Definitions

influence_set

Red-colored nodes a and b are active. The two green areas enclose the nodes activated by a and b respectively, i.e. and .

Note:

Problem Setup

The influential maximization problem is then an optimization problem:

This problem is NP-hard [Kempe et al. 2003]. However, there is a greedy approximation algorithm–Hill Climbing that gives a solution with the following approximation guarantee:

where is the globally optimal solution.

Hill Climbing

Algorithm: at each step , activate and pick the node that has the largest marginal gain :

Claim: Hill Climbing produces a solution that has the approximation guarantee .

Proof of the Approximation Guarantee of Hill Climbing

Definition of Monotone: if and for all , then is monotone.

Definition of Submodular: if for any node and any , then is submodular.

Theorem [Nemhauser et al. 1978]:also see this handout if is monotone and submodular, then the obtained by greedily adding elements that maximize marginal gains satisfies

Given this theorem, we need to prove that the largest expected cascade size function is monotone and submodular.

It is clear that the function is monotone based on the definition of If no nodes are active, then the influence is 0. That is . Because activating more nodes will never hurt the influence, if ., and we only need to prove is submodular.

Fact 1 of Submodular Functions: is submodular, where is a set. Intuitively, the more sets you already have, the less new “area”, a newly added set will provide.

Fact 2 of Submodular Functions: if are submodular and , then is also submodular. That is a non-negative linear combination of submodular functions is a submodular function.

Proof that is Submodular: we run many simulations on graph G (see sidenote 1). For the simulated world , the node has an activation set , then is the size of the cascades of for world . Based on Fact 1, is submodular. The expected influence set size is also submodular, due to Fact 2. QED.

Evaluation of and Approximation Guarantee of Hill Climbing In Practice: how to evaluate is still an open question. The estimation achieved by simulating a number of possible worlds is a good enough evaluation [Kempe et al. 2003]:

Speed-up Hill Climbing by Sketch-Based Algorithms

Time complexity of Hill Climbing

To find the node that (see the algorithm above):

We will do this (number of nodes to be selected) times. Therefore, the time complexity of Hill Climbing is , which is slow. We can use sketches [Cohen et al. 2014] to speed up the evaluation of by reducing the evaluation time from to Besides sketches, there are other proposed approaches for efficiently evaluating the influence function: approximation by hypergraphs [Borgs et al. 2012], approximating Riemann sum [Lucier et al. 2015], sparsification of influence networks [Mathioudakis et al. 2011], and heuristics, such as degree discount [Chen et al. 2009]..

Single Reachability Sketches

Intuition: if can reach a large number of nodes, then its rank is likely to be small. Hence, the rank of node can be used to estimate the influence of node in .

However, influence estimation based on Single Reachability Sketches (i.e. single simulation of ) is inaccurate. To make a more accurate estimate, we need to build sketches based on many simulationsThis is similar to take an average of in sidenote 1, but in this case, it is achieved by using Combined Reachability Sketches., which leads to the Combined Reachability Sketches.

Combined Reachability Sketches

In Combined Reachability Sketches, we simulate several possible worlds and keep the smallest values among the nodes that can reach in all the possible worlds.

Note: using Combined Reachability Sketches does not provide an approximation guarantee on the true expected influence but an approximation guarantee with respect to the possible worlds considered.