Epidemiological Modelling of Peer-to-Peer Viruses and Pollution, Hacking and IT E-Book Dump Release
[ Pobierz całość w formacie PDF ]
Epidemiological Modelling of Peer-to-Peer Viruses
and Pollution
Richard Thommes and Mark Coates
Department of Electrical and Computer Engineering
McGill University
3480 University St
Montreal, QC, Canada H3A 2A7
Email:
{
rthomm,coates
}
@tsp.ece.mcgill.ca
networks unusable.
Abstract
— The popularity of peer-to-peer (P2P) networks
makes them an attractive target to the creators of viruses and
other malicious code. Recently a number of viruses designed
specically to spread via P2P networks have emerged. Pollution
has also become increasingly prevalent as copyright holders inject
multiple decoy versions in order to impede item distribution. In
this paper we derive deterministic epidemiological models for
the propagation of a P2P virus through a P2P network and
the dissemination of pollution. We report on discrete simulations
that provide some verication that the models remain sufciently
accurate despite variations in individual peer conduct to provide
insight into the behaviour of the system. The paper examines
the steady-state behaviour and illustrates how the models may
be used to estimate in a computationally efcient manner how
effective object reputation schemes will be in mitigating the
impact of viruses and preventing the spread of pollution.
In this paper we examine the behaviour of viruses and
pollution in P2P networks. We adopt an epidemiological ap-
proach, developing dynamic models to describe the evolution
of infection/pollution. We consider the stochastic nature of
the system during our development of the models, but our
models are deterministic and focus on the expected behaviour
of the system. We illustrate that these deterministic models are
sufciently accurate to capture the behaviour of P2P networks,
by comparison with more realistic simulations that model
individual peers.
Our initial purpose is to model the impact of malicious code
on a P2P network, but a primary motivation is to examine how
effective the introduction of mitigation techniques might be.
In particular, we focus on
object reputation
schemes (such as
Credence [6]) and methods that increase the rate of elimination
of infected les. Our model provides an analytical method
for determining (at least approximately) how widespread the
adoption of such schemes must be, and how effective they
must be, in order that specic targets of residual pollution or
infection be achieved. We validate these specications through
more accurate simulation of the networks.
I. I
NTRODUCTION
Peer-to-peer (P2P) networks have become increasingly vul-
nerable to malicious behaviour, including the dissemination of
polluted
versions of les and the release of P2P viruses. Early
P2P networks such as Napster focussed exclusively on media
les, so propagation of viruses was difcult to achieve [1].
Contemporary P2P networks such as Kazaa / Fastrack [2]
and eDonkey2000 [3] can be used to disseminate executable
les and are hence much more susceptible, particularly as the
mainstream adoption of P2P le exchange—the eDonkey2000
network alone typically has over 2 million users connected at
any given time [4]—means that a signicant fraction of users
lack the technical knowledge to detect suspicious les or scan
for viruses.
The phenomenon of pollution, the presence of corrupted
(or “bad”) versions of items (songs, movies or multimedia
les) in P2P networks, has become increasingly prevalent.
Some of these versions are made available by accident, as
users make errors in le generation. But the dominant cause is
deliberate dissemination of decoy les, termed
item poisoning
in [5], a technological mechanism employed by copyright
holders and their agents to impede the distribution of content.
These decoy les have names and metadata matching those
of the genuine item, but contain corrupted, unreadable or
inappropriate data. Whether accidental or deliberate, pollution
has rendered a substantial portion of the les on popular P2P
The paper is structured as follows. In the remainder of the
introduction, we highlight the salient features of P2P networks,
viruses and pollution, and discuss related work. Section II
presents a model for the expected evolution of a virus in the
system. In Section III, we analyze the steady-state behaviour of
our P2P virus model. Section IV presents an epidemiological
model for the proliferation of pollution. Section V examines
the impact of object reputation schemes. Section VI reports
on an empirical study of the e-Donkey network, which we
conducted to identify suitable parameters for the examination
of our models. Section VII reports on discrete-time simulations
of the P2P network, which provide a validation that the
deterministic models capture the primary characteristics of
system evolution despite ignoring the variability in behaviour
of individual peers. Finally, Section VIII draws conclusions
based on our analysis and results.
A. Peer-to-peer networks, viruses and pollution
This section highlights the key features shared by pop-
ular P2P Networks, including Kazaa, eDonkey2000, and
Gnutella [7]. Every peer connected to the network has a
shared
folder
containing all the les the user wishes to make publicly
available for download by others on the network. When a
user wants to download a le, he begins by sending out a
search request. In response he receives a list of les matching
the search criteria. The specic manner in which this list is
generated varies among the various P2P networks, but in all
cases the query response is the result of the examination of
the shared folders of a subset of all peers connected to the
network. Once the user elects to download one of the les from
the list, his client attempts to set up a connection to a peer
sharing the le and begins receiving the le. Depending on
the specic network, the client may attempt to simultaneously
download different parts of the le from a number of peers
in order to expedite the operation. P2P clients typically save
new downloaded les in the shared folder – making them
immediately available to other users.
A number of worms and viruses that exploit P2P networks
have already surfaced. The majority of these behave in a
similar fashion. Specically, when a user downloads a le
containing the virus and executes it, a number of new les
containing the virus are created and placed in the client’s
shared directory. Some types of viruses, including Achar [8]
and Gotorm [9], generate a xed list of lenames when
executed. More advanced viruses, such as Bare [10] and
Krepper [11], randomly pick the list of lenames from a large
pool of candidates.
Pollution is a more widespread phenomenon, as indicated by
the empirical study performed in [12]. The study indicated that
the number of versions of relatively popular items is generally
substantial (on the order of tens or hundreds). It was also
observed that the pollution level (the fraction of bad versions)
for a specic item remained approximately constant over time.
and e-mail viruses, there are sufcient differences in their
spreading mechanics to necessitate the development of a new
model. The dynamic pollution model developed in [5] is
closely related to our epidemiological pollution model, and
produces similar behaviour. Phrasing the model in an epi-
demiological framework provides an alternative understanding
of system behaviour. The deterministic models are reason-
ably accurate even with substantial variation in individual
peer behaviour, and we illustrate how they can be used to
estimate in a computationally efcient manner the impact of
an object reputation scheme in mitigating P2P viruses and
pollution. Conversely, the models can be used to determine
how widespread the usage of a reputation scheme must be and
how much it must dampen the probability of downloading an
infected or polluted le in order to achieve a target level of
pollution/infection.
II. P2P V
IRUS
M
ODEL
The intent of our model is to predict the expected behaviour
of a virus which spreads through a P2P network in the form of
malicious code embedded in executable les shared by peers.
We make the simplifying assumption that all users download
les to their shared folder. We are not concerned with the
transfer of media les which cannot contain malicious code,
and do not model them. Note that we use the term
user
in
this paper to refer to a person using a P2P
client
program.
The term
peer
is used to collectively refer to a P2P client and
the user directing its behaviour.
This model classies all peers as falling into one of three
classes:
Susceptible, Exposed
, or
Infected
:
Susceptible
– Peers that are not sharing any infected les,
but are at risk of downloading infected les. The number of
peers in this category at time
t
is denoted by
S(t)
.
Exposed
– Peers that have downloaded one or more
infected les, but have not executed them. The number of
peers in this category at time
t
is denoted by
E(t)
. The
Exposed category is included in the model to allow for a
delay between download of an infected le and execution.
B. Related Work
The advent of mathematical Epidemiology – the eld of
biology which models how diseases spread in a population
– is generally credited to McKendrick and his seminal 1926
paper [13]. Previous work in applying epidemiology to model-
ing how computer viruses and other malware spreads between
machines dates back to the late 1980s/early 1990s [14], [15].
More recently, several authors have utilized epidemiological
models to study the spread of worms and e-mail viruses in
the Internet [16]–[20].
There have been a number of recent papers which model
le propagation in P2P networks [21]–[24]. Dumitriu et al. [5]
model the spread of polluted les in P2P networks, and Liang
et al. report on an empirical study of pollution in P2P networks
in [12]. The behaviour of object reputation mechanisms has
been discussed in [6].
Contribution
: We believe that our paper is the rst to
develop a epidemiological model for peer-to-peer viruses.
Although these viruses share similarities with Internet worms
Infected
– Peers that have executed an infected le. Upon
execution, a total of of
c
infected les reside in the peer’s
shared folder. The number of peers in this category at time
t
is denoted by
I(t)
.
An Infected client may be detected by the user, who will
then proceed to remove all the infected les, thereby returning
the state of the peer to Susceptible. At all times, every one of
the
N
peers making up the network falls into one of the three
categories. Thus, for all values of
t
,
N = S(t) + E(t) + I(t)
.
We assume that the total number of uninfected les in
the network is xed at
M
. The total number of infected
les at time
t
is given by
K(t)
. The expected proportion of
infected les in the network,
q(t)
, is therefore
q(t) =
K(t)
K(t)+M
.
Event Variables Affected
File downloaded
q(t)
,
S(t)
,
E(t)
File executed
q(t)
,
E(t)
,
I(t)
Peer recovers
q(t)
,
I(t)
,
R(t)
TABLE I
P2P V
IRUS MODEL VARIABLES THAT ARE POTENTIALLY AFFECTED BY
EACH POSSIBLE EVENT IN THE NETWORK
.
Exposed peer executes an infected le, the number of Infected
peers increases by one. Since les are executed in order of
download, the le executed by an Exposed peer will always be
the infected le which it had downloaded to become Exposed.
This occurs at a rate of
¸
E
E(t)
. Therefore,
dI(t)
dt
=−¸
R
I(t) + ¸
E
E(t)
(1)
Rate at which the number of Exposed peers changes
: The
rate at which the number of Exposed peers decreases due to
infection is given by the negative of the second term in (1). The
rate at which previously Susceptible peers become Exposed is
dependent on the aggregate rate at which they download les,
¸
S
S(t)
, multiplied by the probability that a downloaded le
is infected,
h(t)
. The overall rate is therefore:
When a user downloads a le, we assume the probability of
choosing an infected le will be dependent on the prevalence
of infected les in the network. The probability will vary
to some degree for different peers, according to whether the
peer has updated virus-detection software or is aware of the
common characteristics of virus les (such les are often
much smaller than genuine versions of the item). In our
model, we are interested in the average probability of choosing
an infected le, and we denote this probability by
h(t)
. In
Section III, where we examine steady-state behaviour, we set
h(t) = ®q(t)
, for some constant
®
, to reect the fact that the
probability is closely tied to virus prevalence and to simplify
our analysis.
There are three distinct events that may occur in the
network which affect one or more of the time-varying
variables described above. These events include a peer
downloading a le from another, a peer executing a shared
le, and an Infected peer recovering. Although individual
peers conduct these activities at (potentially very) different
rates, we develop our model based on average behaviour. Our
simulation results in Section VII indicate that this modelling
choice does not produce substantially erroneous behaviour.
The average rates at which each of these events occurs are
governed by three parameters:
dE(t)
dt
=−¸
E
E(t) + ¸
S
S(t)h(t)
(2)
Rate at which the number of Susceptible peers changes
:
Since
N
is xed, it always holds that
dS(t)
dt
+
dE(t)
dt
+
dI(t)
dt
= 0
.
dS(t)
dt
Therefore,
is the negative sum of (1) and
(2):
dS(t)
dt
=−¸
S
S(t)h(t) + ¸
R
I(t)
(3)
Rate at which the number of infected les in the network
changes
: There are three events which result in a change in
the number of infected les in the network: a peer downloads
an infected le, an Exposed peer becomes Infected, and an
Infected peer recovers. We assume that all downloaded les
are executed, and that a peer does not download any additional
les prior to executing the most recently downloaded le.
Peers cannot share more than one copy of a le with the
same name. If the number of unique infected lenames is
limited to
c
, only Susceptible peers can download infected
les. Exposed peers do not download any additional les
before becoming Infected, and Infected peers are sharing all
c
possible infected les. Thus, the rate of change due to
downloads is
S(t)¸
S
h(t)
.
An Exposed peer always has one infected le before be-
coming Infected, meaning in all cases
c−1
new infected les
are created when an Exposed peer becomes Infected. The rate
of change is thus
E(t)¸
S
(c−1)
.
An Infected peer will always share
c
les, so a recovery
results in a reduction of
c
infected les. The rate is therefore
−I(t)¸
R
c
. The overall rate of change of
K
is therefore:
dK(t)
dt
¸
S
: Average rate, in les per minute, at which each peer
downloads new les (this includes time spent searching and
setting up the connection to another peer).
¸
E
: Average rate, in les per minute, at which each peer
executes shared les. We assume that a peer executes les in
the order in which they are downloaded.
¸
R
: Average rate, in “recoveries per minute”, at which
Infected peers recover. A recovery occurs when all infected
les are removed, returning the peer state to Susceptible.
= S(t)¸
S
h(t) + E(t)¸
E
(c−1)−I(t)¸
R
c
(4)
A. Model Equations
Table I summarizes which time-varying variables are af-
fected by each of the three events that may occur in the
network. The state progression for all peers in our model is
S!E!I!S...
. We now derive the differential equations
that govern the evolution of our P2P model.
Rate at which the number of Infected peers changes
: When
an Infected peer recovers, the number of Infected peers de-
creases by one. Recoveries occur at rate
¸
R
I(t)
. When an
We note that if the names of generated les are chosen from
a pool of names much larger than
c
, Infected peers can con-
tinue to download infected les and the above equation does
not hold. The model and analysis in this case becomes more
involved. See [25] for a discussion on this and other variations
of the model, including cases where not all downloaded les
are executed and where multiple downloads are possible prior
to execution.
h(t) = f(q(t))
. Dening
E
,
I
,
S
, as the steady-state values of,
respectively,
E(t)
,
I(t)
, and
S(t)
, Equation (1) implies that:
B. Model Extensions
1) Modeling On-line/Off-line Behaviour:
In a real P2P net-
work, individual peers are only on-line for limited durations.
In order to capture this behavior, we present an extension
of our model that includes both on-line and off-line users.
Each of the three variables specifying how many peers are in
each category –
S, E, I
– is partitioned into two variables to
account for how many peers in the category are on and off-
line. So, for instance,
I(t) = I
N
(t) + I
F
(t)
, where
I
N
(t)
is
the number of Infected peers on-line, and
I
F
(t)
is the number
of Infected peers ofine. Peers that are off-line go on-line at
a certain rate
¸
N
, and on-line peers go off-line at rate
¸
F
.
The differential equation governing the change in the number
of on-line Infected peers at time
t
is:
I = E
¸
E
¸
R
(6)
If we dene
¿
and
µ
as, respectively, the expected number
of infected les each Exposed and Infected peer is sharing in
steady-state, then
q
, the proportion of infected les in steady-
state may be expressed as:
E¿ + Iµ
M + E¿ + Iµ
q =
(7)
Substituting (6) into (7) provides:
dI
N
(t)
dt
E(¿ ¸
R
+ µ¸
E
)
M¸
R
+
= I
F
(t)¸
N
−I
N
(t)¸
F
(5)
q =
(8)
E(¿ ¸
R
+ µ¸
E
)
The equations governing the rates of change in
S
N
(t)
and
E
N
(t)
are analogous. We assume here that peers go on and
off-line at the same rate regardless of their state. It would also
be simple to expand the model to include different rates for
each state.
To complete the specication of the extended model, all
the previously dened differential equations are changed as
follows: every instance of
S(t)
,
E(t)
, and
I(t)
is replaced,
respectively, by
S
N
(t)
,
E
N
(t)
, and
I
N
(t)
.
If
f(q) > 0
, equation (2) implies that, in steady state:
¸
E
¸
S
f(q)
S =
E
(9)
S = N−I−E
, equation (6) can be utilized to express
Since
N
as:
S = N−E(1 +
¸
E
¸
R
)
(10)
2) Modeling Peers that Remain Infected:
One can argue
that a certain proportion of P2P users, when their client
becomes Infected, will never detect that this has occurred and
not take any action to remedy this problem. In order to include
this behaviour in our model, we classify all peers as “aware”
or “oblivious”. Aware peers behave as those in our basic model
described in II-A, while oblivious peers progress
S!I
and
then remain Infected. The number of peers in each group is
xed:
N = N
A
+N
O
where
N
A
is the number of aware peers,
and
N
O
is the number of oblivious peers.
As in Section II-B.1, the number of peers falling into each
of the four categories at time
t
is partitioned into two groups;
in this case the number of aware users in category
X
at time
t
where
X2{S, E, I}
, is denoted by
X
A
(t)
and the number
of oblivious users in each category is denoted by
X
O
(t)
. The
behaviour of aware users is determined by equations (1), (2),
and (3), with
X
A
(t)
replacing
X(t)
for all
X2{S, E, I}
.
Oblivious users are governed by (1), (2), and (3), with
X
0
(t)
replacing
X(t)
, and
¸
R
set to zero (reecting the fact that
oblivious peers never recover). Finally,
dK(t)
If
h(t)
is proportional to
q(t)
,
h(t) = ®q(t)
, we can obtain
a closed-form expression for
E
by substituting (8) into (9),
equating with (10), and solving for
E
:
¸
R
®(N¸
S
(µ¸
E
+ ¿ ¸
R
)−M¸
E
¸
R
)
(¿ ¸
R
+ µ¸
E
)(¸
S
®(¸
R
+ ¸
E
) + ¸
E
¸
R
)
E =
; q > 0
(11)
I
follows trivially from (11) and (6):
The expression for
¸
E
®(N¸
S
(µ¸
E
+ ¿ ¸
R
)−M¸
E
¸
R
)
(¿ ¸
R
+ µ¸
E
)(¸
S
®(¸
R
+ ¸
E
) + ¸
E
¸
R
)
I =
; q > 0
(12)
If
q = 0
, it follows from (7) that
E = I = 0
. It is of interest
to consider Equation (12) as it approaches 0. In the limiting
case, approached from above, we have the equality
N¸
S
(µ¸
E
+ ¿ ¸
R
) = M¸
E
¸
R
(13)
Since we assume that all downloaded les are eventually
executed, it follows that it is reasonable to equate the rates
of download and execution,
¸
E
= ¸
S
. Under this assump-
tion, (13) provides the following minimum average recover
rate,
¸
mi
R
in order for all infected les to eventually be
removed from a P2P network:
dt
is governed by
a modied version of (4), with
S(t)
replaced by
S
A
(t)+S
O
(t)
,
E(t)
replaced by
E
A
(t) + E
O
(t)
, and
I(t)
replaced by
I
A
(t)
.
III. A
NALYSIS
- S
TABILITY
R
ESULTS
If the P2P network reaches a steady-state equilibrium by
some time
t = T
, then
Nµ¸
E
M−N¿
¸
min
R
=
; M > N¿
(14)
dE(T )
dt
dI(T )
dt
=
dS(T )
dt
= 0
. In
this section, we assume that the probability of downloading an
infected le is a function of the proportion of infected les, i.e.,
=
This equation indicates that, if
h(t) = ®q(t)
, then
¸
min
R
is
a linearly increasing function of
¸
E
.
from the susceptible to recovered state by downloading a
good version, it shares the le with probability
p
sg
. When
a peer transitions from the susceptible to infected state by
downloading a bad le, it shares the le with probability
p
sb
.
When a peer transitions from the infected to susceptible state
or recovered state, it removes the polluted le with probability
p
db
. We model the probability of downloading a polluted le
at time
t
,
p
b
(t)
, as being equal to the fraction of polluted
les. This probability is the same for a peer irrespective of
how many times it has been infected. This is a reasonable
approximation because the number of versions of an item is
anticipated to be much larger than the number of re-tries.
We model the expected behaviour of a large group of peers.
At time
t
, a fraction of the susceptible peers
¸
s
download a
le. This is effectively the download rate. A fraction
¸
r
of the
infected peers decide to retry and hence rejoin the susceptible
pool. A fraction
¸
x
of the infected peers choose to give up and
enter the recovered state. We make the simplifying assumption
that the download rate, and the rates of trying again and giving
up (
¸
r
and
¸
x
) do not vary over time. A constant value of
¸
s
produces the approximately exponential decay in the number
of downloads of an item as time elapses and its popularity
declines. It is reasonable to assume that the variation of the
rates of trying again or giving up do not change substantially
over time.
With these modelling choices, we arrive at the following
set of equations that describe the evolution of pollution in the
system.
decide to
retry
S
I
download
polluted file
download
good file
decide to
give up
R
Fig. 1. The transition diagram for peers indicating the actions that trigger
movement between the three classes of susceptible (S), infected (I) and
recovered (R)
IV. P2P P
OLLUTION
M
ODEL
We assume that
M
i
peers are interested in item
i
, and that
there are a multitude of versions of the item, classied as
“good” or “bad”. Initially the P2P network is seeded with
N
g
(0)
good les and
N
b
(0)
bad les. The peers who provided
these seed les do not number among the
M
i
peers we
consider in our model. We model the peers as belonging to
three classes:
Susceptible
,
Infected
, and
Recovered
.
S(t)
is the
number of susceptible peers at time
t
; this class includes all
peers who will attempt to download another version of the le
in the future. Initially
S(0) = M
i
, as all interested peers are
susceptible.
I(0) = 0
and
R(0) = 0
, because no les have
been downloaded from the seeds.
A peer transitions between the three states as depicted in
the transition diagram in Figure 1. Each peer is susceptible
when it intends to download a le. When a susceptible peer
downloads a le, it joins the Infected class if the le is bad
and the Recovered class if the le is good. A peer may leave
the Infected class by testing the downloaded le and electing
to retry at some stage in the future. In this case, the peer
rejoins the Susceptible class. Alternatively, an infected peer
may decide to give up and join the Recovered class, despite
not being successful in acquiring a good version of the item.
A peer may dwell in the infected state for some period of
time before choosing to give up or to retry. This represents
the period of time before an infected peer tests a downloaded
le.
Eventually all peers will belong to the Recovered class.
We label this class “recovered” primarily to highlight the
parallels with standard epidemiological models. In our model
the distinguishing feature of a recovered peer is that it is no
longer actively seeking the item of interest. Note that in our
model, any susceptible or infected peer may be sharing none
or several polluted les, but cannot be sharing a good le. A
recovered peer may share at most one good le and may share
several polluted les.
The number of good shared versions of the item varies over
time, as does the number of bad. When a peer transitions
N
b
(t)
N
b
(t) + N
g
(t)
p
b
(t) =
(15)
dS(t)
dt
=−¸
s
S(t) + ¸
r
I(t)
(16)
dI(t)
dt
= p
b
(t) ¸
s
I(t)−(¸
r
+ ¸
x
)I(t)
(17)
dR(t)
dt
= (1−p
b
(t))¸
s
S(t) + ¸
x
I(t)
(18)
dN
b
(t)
dt
= ¸
s
p
b
(t) p
sb
S(t)−(¸
r
+ ¸
x
) p
db
p
sb
I(t)
(19)
dN
g
(t)
dt
= ¸
s
(1−p
b
(t)) p
sg
S(t)
(20)
As with the P2P virus model, these equations are derived
under the assumption that all peers have common behaviour;
variability in individual behaviour means that this will not
be a completely accurate model of the system. In addition,
the model does not address any notion of memory in user
behaviour; it is probable that a peer’s downloading behaviour
would change substantially if it has already received several
bad versions of an item. In simulations in Section VII, we
account for variability in peer behaviour and a limited notion
of memory; our results indicate that the deterministic model
described above, despite its limitations and assumptions, pro-
vides a good indication of the evolution of the extent of
pollution in the P2P network (for a specic item).
[ Pobierz całość w formacie PDF ]
zanotowane.pl doc.pisz.pl pdf.pisz.pl upanicza.keep.pl
Epidemiological Modelling of Peer-to-Peer Viruses
and Pollution
Richard Thommes and Mark Coates
Department of Electrical and Computer Engineering
McGill University
3480 University St
Montreal, QC, Canada H3A 2A7
Email:
{
rthomm,coates
}
@tsp.ece.mcgill.ca
networks unusable.
Abstract
— The popularity of peer-to-peer (P2P) networks
makes them an attractive target to the creators of viruses and
other malicious code. Recently a number of viruses designed
specically to spread via P2P networks have emerged. Pollution
has also become increasingly prevalent as copyright holders inject
multiple decoy versions in order to impede item distribution. In
this paper we derive deterministic epidemiological models for
the propagation of a P2P virus through a P2P network and
the dissemination of pollution. We report on discrete simulations
that provide some verication that the models remain sufciently
accurate despite variations in individual peer conduct to provide
insight into the behaviour of the system. The paper examines
the steady-state behaviour and illustrates how the models may
be used to estimate in a computationally efcient manner how
effective object reputation schemes will be in mitigating the
impact of viruses and preventing the spread of pollution.
In this paper we examine the behaviour of viruses and
pollution in P2P networks. We adopt an epidemiological ap-
proach, developing dynamic models to describe the evolution
of infection/pollution. We consider the stochastic nature of
the system during our development of the models, but our
models are deterministic and focus on the expected behaviour
of the system. We illustrate that these deterministic models are
sufciently accurate to capture the behaviour of P2P networks,
by comparison with more realistic simulations that model
individual peers.
Our initial purpose is to model the impact of malicious code
on a P2P network, but a primary motivation is to examine how
effective the introduction of mitigation techniques might be.
In particular, we focus on
object reputation
schemes (such as
Credence [6]) and methods that increase the rate of elimination
of infected les. Our model provides an analytical method
for determining (at least approximately) how widespread the
adoption of such schemes must be, and how effective they
must be, in order that specic targets of residual pollution or
infection be achieved. We validate these specications through
more accurate simulation of the networks.
I. I
NTRODUCTION
Peer-to-peer (P2P) networks have become increasingly vul-
nerable to malicious behaviour, including the dissemination of
polluted
versions of les and the release of P2P viruses. Early
P2P networks such as Napster focussed exclusively on media
les, so propagation of viruses was difcult to achieve [1].
Contemporary P2P networks such as Kazaa / Fastrack [2]
and eDonkey2000 [3] can be used to disseminate executable
les and are hence much more susceptible, particularly as the
mainstream adoption of P2P le exchange—the eDonkey2000
network alone typically has over 2 million users connected at
any given time [4]—means that a signicant fraction of users
lack the technical knowledge to detect suspicious les or scan
for viruses.
The phenomenon of pollution, the presence of corrupted
(or “bad”) versions of items (songs, movies or multimedia
les) in P2P networks, has become increasingly prevalent.
Some of these versions are made available by accident, as
users make errors in le generation. But the dominant cause is
deliberate dissemination of decoy les, termed
item poisoning
in [5], a technological mechanism employed by copyright
holders and their agents to impede the distribution of content.
These decoy les have names and metadata matching those
of the genuine item, but contain corrupted, unreadable or
inappropriate data. Whether accidental or deliberate, pollution
has rendered a substantial portion of the les on popular P2P
The paper is structured as follows. In the remainder of the
introduction, we highlight the salient features of P2P networks,
viruses and pollution, and discuss related work. Section II
presents a model for the expected evolution of a virus in the
system. In Section III, we analyze the steady-state behaviour of
our P2P virus model. Section IV presents an epidemiological
model for the proliferation of pollution. Section V examines
the impact of object reputation schemes. Section VI reports
on an empirical study of the e-Donkey network, which we
conducted to identify suitable parameters for the examination
of our models. Section VII reports on discrete-time simulations
of the P2P network, which provide a validation that the
deterministic models capture the primary characteristics of
system evolution despite ignoring the variability in behaviour
of individual peers. Finally, Section VIII draws conclusions
based on our analysis and results.
A. Peer-to-peer networks, viruses and pollution
This section highlights the key features shared by pop-
ular P2P Networks, including Kazaa, eDonkey2000, and
Gnutella [7]. Every peer connected to the network has a
shared
folder
containing all the les the user wishes to make publicly
available for download by others on the network. When a
user wants to download a le, he begins by sending out a
search request. In response he receives a list of les matching
the search criteria. The specic manner in which this list is
generated varies among the various P2P networks, but in all
cases the query response is the result of the examination of
the shared folders of a subset of all peers connected to the
network. Once the user elects to download one of the les from
the list, his client attempts to set up a connection to a peer
sharing the le and begins receiving the le. Depending on
the specic network, the client may attempt to simultaneously
download different parts of the le from a number of peers
in order to expedite the operation. P2P clients typically save
new downloaded les in the shared folder – making them
immediately available to other users.
A number of worms and viruses that exploit P2P networks
have already surfaced. The majority of these behave in a
similar fashion. Specically, when a user downloads a le
containing the virus and executes it, a number of new les
containing the virus are created and placed in the client’s
shared directory. Some types of viruses, including Achar [8]
and Gotorm [9], generate a xed list of lenames when
executed. More advanced viruses, such as Bare [10] and
Krepper [11], randomly pick the list of lenames from a large
pool of candidates.
Pollution is a more widespread phenomenon, as indicated by
the empirical study performed in [12]. The study indicated that
the number of versions of relatively popular items is generally
substantial (on the order of tens or hundreds). It was also
observed that the pollution level (the fraction of bad versions)
for a specic item remained approximately constant over time.
and e-mail viruses, there are sufcient differences in their
spreading mechanics to necessitate the development of a new
model. The dynamic pollution model developed in [5] is
closely related to our epidemiological pollution model, and
produces similar behaviour. Phrasing the model in an epi-
demiological framework provides an alternative understanding
of system behaviour. The deterministic models are reason-
ably accurate even with substantial variation in individual
peer behaviour, and we illustrate how they can be used to
estimate in a computationally efcient manner the impact of
an object reputation scheme in mitigating P2P viruses and
pollution. Conversely, the models can be used to determine
how widespread the usage of a reputation scheme must be and
how much it must dampen the probability of downloading an
infected or polluted le in order to achieve a target level of
pollution/infection.
II. P2P V
IRUS
M
ODEL
The intent of our model is to predict the expected behaviour
of a virus which spreads through a P2P network in the form of
malicious code embedded in executable les shared by peers.
We make the simplifying assumption that all users download
les to their shared folder. We are not concerned with the
transfer of media les which cannot contain malicious code,
and do not model them. Note that we use the term
user
in
this paper to refer to a person using a P2P
client
program.
The term
peer
is used to collectively refer to a P2P client and
the user directing its behaviour.
This model classies all peers as falling into one of three
classes:
Susceptible, Exposed
, or
Infected
:
Susceptible
– Peers that are not sharing any infected les,
but are at risk of downloading infected les. The number of
peers in this category at time
t
is denoted by
S(t)
.
Exposed
– Peers that have downloaded one or more
infected les, but have not executed them. The number of
peers in this category at time
t
is denoted by
E(t)
. The
Exposed category is included in the model to allow for a
delay between download of an infected le and execution.
B. Related Work
The advent of mathematical Epidemiology – the eld of
biology which models how diseases spread in a population
– is generally credited to McKendrick and his seminal 1926
paper [13]. Previous work in applying epidemiology to model-
ing how computer viruses and other malware spreads between
machines dates back to the late 1980s/early 1990s [14], [15].
More recently, several authors have utilized epidemiological
models to study the spread of worms and e-mail viruses in
the Internet [16]–[20].
There have been a number of recent papers which model
le propagation in P2P networks [21]–[24]. Dumitriu et al. [5]
model the spread of polluted les in P2P networks, and Liang
et al. report on an empirical study of pollution in P2P networks
in [12]. The behaviour of object reputation mechanisms has
been discussed in [6].
Contribution
: We believe that our paper is the rst to
develop a epidemiological model for peer-to-peer viruses.
Although these viruses share similarities with Internet worms
Infected
– Peers that have executed an infected le. Upon
execution, a total of of
c
infected les reside in the peer’s
shared folder. The number of peers in this category at time
t
is denoted by
I(t)
.
An Infected client may be detected by the user, who will
then proceed to remove all the infected les, thereby returning
the state of the peer to Susceptible. At all times, every one of
the
N
peers making up the network falls into one of the three
categories. Thus, for all values of
t
,
N = S(t) + E(t) + I(t)
.
We assume that the total number of uninfected les in
the network is xed at
M
. The total number of infected
les at time
t
is given by
K(t)
. The expected proportion of
infected les in the network,
q(t)
, is therefore
q(t) =
K(t)
K(t)+M
.
Event Variables Affected
File downloaded
q(t)
,
S(t)
,
E(t)
File executed
q(t)
,
E(t)
,
I(t)
Peer recovers
q(t)
,
I(t)
,
R(t)
TABLE I
P2P V
IRUS MODEL VARIABLES THAT ARE POTENTIALLY AFFECTED BY
EACH POSSIBLE EVENT IN THE NETWORK
.
Exposed peer executes an infected le, the number of Infected
peers increases by one. Since les are executed in order of
download, the le executed by an Exposed peer will always be
the infected le which it had downloaded to become Exposed.
This occurs at a rate of
¸
E
E(t)
. Therefore,
dI(t)
dt
=−¸
R
I(t) + ¸
E
E(t)
(1)
Rate at which the number of Exposed peers changes
: The
rate at which the number of Exposed peers decreases due to
infection is given by the negative of the second term in (1). The
rate at which previously Susceptible peers become Exposed is
dependent on the aggregate rate at which they download les,
¸
S
S(t)
, multiplied by the probability that a downloaded le
is infected,
h(t)
. The overall rate is therefore:
When a user downloads a le, we assume the probability of
choosing an infected le will be dependent on the prevalence
of infected les in the network. The probability will vary
to some degree for different peers, according to whether the
peer has updated virus-detection software or is aware of the
common characteristics of virus les (such les are often
much smaller than genuine versions of the item). In our
model, we are interested in the average probability of choosing
an infected le, and we denote this probability by
h(t)
. In
Section III, where we examine steady-state behaviour, we set
h(t) = ®q(t)
, for some constant
®
, to reect the fact that the
probability is closely tied to virus prevalence and to simplify
our analysis.
There are three distinct events that may occur in the
network which affect one or more of the time-varying
variables described above. These events include a peer
downloading a le from another, a peer executing a shared
le, and an Infected peer recovering. Although individual
peers conduct these activities at (potentially very) different
rates, we develop our model based on average behaviour. Our
simulation results in Section VII indicate that this modelling
choice does not produce substantially erroneous behaviour.
The average rates at which each of these events occurs are
governed by three parameters:
dE(t)
dt
=−¸
E
E(t) + ¸
S
S(t)h(t)
(2)
Rate at which the number of Susceptible peers changes
:
Since
N
is xed, it always holds that
dS(t)
dt
+
dE(t)
dt
+
dI(t)
dt
= 0
.
dS(t)
dt
Therefore,
is the negative sum of (1) and
(2):
dS(t)
dt
=−¸
S
S(t)h(t) + ¸
R
I(t)
(3)
Rate at which the number of infected les in the network
changes
: There are three events which result in a change in
the number of infected les in the network: a peer downloads
an infected le, an Exposed peer becomes Infected, and an
Infected peer recovers. We assume that all downloaded les
are executed, and that a peer does not download any additional
les prior to executing the most recently downloaded le.
Peers cannot share more than one copy of a le with the
same name. If the number of unique infected lenames is
limited to
c
, only Susceptible peers can download infected
les. Exposed peers do not download any additional les
before becoming Infected, and Infected peers are sharing all
c
possible infected les. Thus, the rate of change due to
downloads is
S(t)¸
S
h(t)
.
An Exposed peer always has one infected le before be-
coming Infected, meaning in all cases
c−1
new infected les
are created when an Exposed peer becomes Infected. The rate
of change is thus
E(t)¸
S
(c−1)
.
An Infected peer will always share
c
les, so a recovery
results in a reduction of
c
infected les. The rate is therefore
−I(t)¸
R
c
. The overall rate of change of
K
is therefore:
dK(t)
dt
¸
S
: Average rate, in les per minute, at which each peer
downloads new les (this includes time spent searching and
setting up the connection to another peer).
¸
E
: Average rate, in les per minute, at which each peer
executes shared les. We assume that a peer executes les in
the order in which they are downloaded.
¸
R
: Average rate, in “recoveries per minute”, at which
Infected peers recover. A recovery occurs when all infected
les are removed, returning the peer state to Susceptible.
= S(t)¸
S
h(t) + E(t)¸
E
(c−1)−I(t)¸
R
c
(4)
A. Model Equations
Table I summarizes which time-varying variables are af-
fected by each of the three events that may occur in the
network. The state progression for all peers in our model is
S!E!I!S...
. We now derive the differential equations
that govern the evolution of our P2P model.
Rate at which the number of Infected peers changes
: When
an Infected peer recovers, the number of Infected peers de-
creases by one. Recoveries occur at rate
¸
R
I(t)
. When an
We note that if the names of generated les are chosen from
a pool of names much larger than
c
, Infected peers can con-
tinue to download infected les and the above equation does
not hold. The model and analysis in this case becomes more
involved. See [25] for a discussion on this and other variations
of the model, including cases where not all downloaded les
are executed and where multiple downloads are possible prior
to execution.
h(t) = f(q(t))
. Dening
E
,
I
,
S
, as the steady-state values of,
respectively,
E(t)
,
I(t)
, and
S(t)
, Equation (1) implies that:
B. Model Extensions
1) Modeling On-line/Off-line Behaviour:
In a real P2P net-
work, individual peers are only on-line for limited durations.
In order to capture this behavior, we present an extension
of our model that includes both on-line and off-line users.
Each of the three variables specifying how many peers are in
each category –
S, E, I
– is partitioned into two variables to
account for how many peers in the category are on and off-
line. So, for instance,
I(t) = I
N
(t) + I
F
(t)
, where
I
N
(t)
is
the number of Infected peers on-line, and
I
F
(t)
is the number
of Infected peers ofine. Peers that are off-line go on-line at
a certain rate
¸
N
, and on-line peers go off-line at rate
¸
F
.
The differential equation governing the change in the number
of on-line Infected peers at time
t
is:
I = E
¸
E
¸
R
(6)
If we dene
¿
and
µ
as, respectively, the expected number
of infected les each Exposed and Infected peer is sharing in
steady-state, then
q
, the proportion of infected les in steady-
state may be expressed as:
E¿ + Iµ
M + E¿ + Iµ
q =
(7)
Substituting (6) into (7) provides:
dI
N
(t)
dt
E(¿ ¸
R
+ µ¸
E
)
M¸
R
+
= I
F
(t)¸
N
−I
N
(t)¸
F
(5)
q =
(8)
E(¿ ¸
R
+ µ¸
E
)
The equations governing the rates of change in
S
N
(t)
and
E
N
(t)
are analogous. We assume here that peers go on and
off-line at the same rate regardless of their state. It would also
be simple to expand the model to include different rates for
each state.
To complete the specication of the extended model, all
the previously dened differential equations are changed as
follows: every instance of
S(t)
,
E(t)
, and
I(t)
is replaced,
respectively, by
S
N
(t)
,
E
N
(t)
, and
I
N
(t)
.
If
f(q) > 0
, equation (2) implies that, in steady state:
¸
E
¸
S
f(q)
S =
E
(9)
S = N−I−E
, equation (6) can be utilized to express
Since
N
as:
S = N−E(1 +
¸
E
¸
R
)
(10)
2) Modeling Peers that Remain Infected:
One can argue
that a certain proportion of P2P users, when their client
becomes Infected, will never detect that this has occurred and
not take any action to remedy this problem. In order to include
this behaviour in our model, we classify all peers as “aware”
or “oblivious”. Aware peers behave as those in our basic model
described in II-A, while oblivious peers progress
S!I
and
then remain Infected. The number of peers in each group is
xed:
N = N
A
+N
O
where
N
A
is the number of aware peers,
and
N
O
is the number of oblivious peers.
As in Section II-B.1, the number of peers falling into each
of the four categories at time
t
is partitioned into two groups;
in this case the number of aware users in category
X
at time
t
where
X2{S, E, I}
, is denoted by
X
A
(t)
and the number
of oblivious users in each category is denoted by
X
O
(t)
. The
behaviour of aware users is determined by equations (1), (2),
and (3), with
X
A
(t)
replacing
X(t)
for all
X2{S, E, I}
.
Oblivious users are governed by (1), (2), and (3), with
X
0
(t)
replacing
X(t)
, and
¸
R
set to zero (reecting the fact that
oblivious peers never recover). Finally,
dK(t)
If
h(t)
is proportional to
q(t)
,
h(t) = ®q(t)
, we can obtain
a closed-form expression for
E
by substituting (8) into (9),
equating with (10), and solving for
E
:
¸
R
®(N¸
S
(µ¸
E
+ ¿ ¸
R
)−M¸
E
¸
R
)
(¿ ¸
R
+ µ¸
E
)(¸
S
®(¸
R
+ ¸
E
) + ¸
E
¸
R
)
E =
; q > 0
(11)
I
follows trivially from (11) and (6):
The expression for
¸
E
®(N¸
S
(µ¸
E
+ ¿ ¸
R
)−M¸
E
¸
R
)
(¿ ¸
R
+ µ¸
E
)(¸
S
®(¸
R
+ ¸
E
) + ¸
E
¸
R
)
I =
; q > 0
(12)
If
q = 0
, it follows from (7) that
E = I = 0
. It is of interest
to consider Equation (12) as it approaches 0. In the limiting
case, approached from above, we have the equality
N¸
S
(µ¸
E
+ ¿ ¸
R
) = M¸
E
¸
R
(13)
Since we assume that all downloaded les are eventually
executed, it follows that it is reasonable to equate the rates
of download and execution,
¸
E
= ¸
S
. Under this assump-
tion, (13) provides the following minimum average recover
rate,
¸
mi
R
in order for all infected les to eventually be
removed from a P2P network:
dt
is governed by
a modied version of (4), with
S(t)
replaced by
S
A
(t)+S
O
(t)
,
E(t)
replaced by
E
A
(t) + E
O
(t)
, and
I(t)
replaced by
I
A
(t)
.
III. A
NALYSIS
- S
TABILITY
R
ESULTS
If the P2P network reaches a steady-state equilibrium by
some time
t = T
, then
Nµ¸
E
M−N¿
¸
min
R
=
; M > N¿
(14)
dE(T )
dt
dI(T )
dt
=
dS(T )
dt
= 0
. In
this section, we assume that the probability of downloading an
infected le is a function of the proportion of infected les, i.e.,
=
This equation indicates that, if
h(t) = ®q(t)
, then
¸
min
R
is
a linearly increasing function of
¸
E
.
from the susceptible to recovered state by downloading a
good version, it shares the le with probability
p
sg
. When
a peer transitions from the susceptible to infected state by
downloading a bad le, it shares the le with probability
p
sb
.
When a peer transitions from the infected to susceptible state
or recovered state, it removes the polluted le with probability
p
db
. We model the probability of downloading a polluted le
at time
t
,
p
b
(t)
, as being equal to the fraction of polluted
les. This probability is the same for a peer irrespective of
how many times it has been infected. This is a reasonable
approximation because the number of versions of an item is
anticipated to be much larger than the number of re-tries.
We model the expected behaviour of a large group of peers.
At time
t
, a fraction of the susceptible peers
¸
s
download a
le. This is effectively the download rate. A fraction
¸
r
of the
infected peers decide to retry and hence rejoin the susceptible
pool. A fraction
¸
x
of the infected peers choose to give up and
enter the recovered state. We make the simplifying assumption
that the download rate, and the rates of trying again and giving
up (
¸
r
and
¸
x
) do not vary over time. A constant value of
¸
s
produces the approximately exponential decay in the number
of downloads of an item as time elapses and its popularity
declines. It is reasonable to assume that the variation of the
rates of trying again or giving up do not change substantially
over time.
With these modelling choices, we arrive at the following
set of equations that describe the evolution of pollution in the
system.
decide to
retry
S
I
download
polluted file
download
good file
decide to
give up
R
Fig. 1. The transition diagram for peers indicating the actions that trigger
movement between the three classes of susceptible (S), infected (I) and
recovered (R)
IV. P2P P
OLLUTION
M
ODEL
We assume that
M
i
peers are interested in item
i
, and that
there are a multitude of versions of the item, classied as
“good” or “bad”. Initially the P2P network is seeded with
N
g
(0)
good les and
N
b
(0)
bad les. The peers who provided
these seed les do not number among the
M
i
peers we
consider in our model. We model the peers as belonging to
three classes:
Susceptible
,
Infected
, and
Recovered
.
S(t)
is the
number of susceptible peers at time
t
; this class includes all
peers who will attempt to download another version of the le
in the future. Initially
S(0) = M
i
, as all interested peers are
susceptible.
I(0) = 0
and
R(0) = 0
, because no les have
been downloaded from the seeds.
A peer transitions between the three states as depicted in
the transition diagram in Figure 1. Each peer is susceptible
when it intends to download a le. When a susceptible peer
downloads a le, it joins the Infected class if the le is bad
and the Recovered class if the le is good. A peer may leave
the Infected class by testing the downloaded le and electing
to retry at some stage in the future. In this case, the peer
rejoins the Susceptible class. Alternatively, an infected peer
may decide to give up and join the Recovered class, despite
not being successful in acquiring a good version of the item.
A peer may dwell in the infected state for some period of
time before choosing to give up or to retry. This represents
the period of time before an infected peer tests a downloaded
le.
Eventually all peers will belong to the Recovered class.
We label this class “recovered” primarily to highlight the
parallels with standard epidemiological models. In our model
the distinguishing feature of a recovered peer is that it is no
longer actively seeking the item of interest. Note that in our
model, any susceptible or infected peer may be sharing none
or several polluted les, but cannot be sharing a good le. A
recovered peer may share at most one good le and may share
several polluted les.
The number of good shared versions of the item varies over
time, as does the number of bad. When a peer transitions
N
b
(t)
N
b
(t) + N
g
(t)
p
b
(t) =
(15)
dS(t)
dt
=−¸
s
S(t) + ¸
r
I(t)
(16)
dI(t)
dt
= p
b
(t) ¸
s
I(t)−(¸
r
+ ¸
x
)I(t)
(17)
dR(t)
dt
= (1−p
b
(t))¸
s
S(t) + ¸
x
I(t)
(18)
dN
b
(t)
dt
= ¸
s
p
b
(t) p
sb
S(t)−(¸
r
+ ¸
x
) p
db
p
sb
I(t)
(19)
dN
g
(t)
dt
= ¸
s
(1−p
b
(t)) p
sg
S(t)
(20)
As with the P2P virus model, these equations are derived
under the assumption that all peers have common behaviour;
variability in individual behaviour means that this will not
be a completely accurate model of the system. In addition,
the model does not address any notion of memory in user
behaviour; it is probable that a peer’s downloading behaviour
would change substantially if it has already received several
bad versions of an item. In simulations in Section VII, we
account for variability in peer behaviour and a limited notion
of memory; our results indicate that the deterministic model
described above, despite its limitations and assumptions, pro-
vides a good indication of the evolution of the extent of
pollution in the P2P network (for a specic item).
[ Pobierz całość w formacie PDF ]