Email Virus Propagation Modeling and Analysis, Hacking and IT E-Book Dump Release

[ Pobierz całość w formacie PDF ]
1
Email Virus Propagation
Modeling and Analysis
Cliff C. Zou

, Don Towsley

, Weibo Gong


Department of Electrical & Computer Engineering

Department of Computer Science
Univ. Massachusetts, Amherst
Technical Report: TR-CSE-03-04
Abstract
Email viruses constitute one of the major Internet security problems. In this paper we present an
email virus model that accounts for the behaviors of email users, such as email checking frequency and
the probability of opening an email attachment. Email viruses spread over a logical network defined by
email address books. The topology of email network plays an important role in determining the behavior
of an email virus spreading. Our observations suggest that the node degrees in an email network are
heavy-tailed distributed and we model it as a power law network. We compare email virus propagation
on three topologies: power law, small world and random graph topologies. The impact of the power law
topology on the spread of email viruses is mixed: email viruses spread more quickly than on a small
world or a random graph topology but immunization defense against viruses is more effective on a power
law topology.
Methods keywords
: Simulations, Graph theory, Statistics.
I. I
NTRODUCTION
Computer viruses have been studied for a long time both by the research and by the application
communities. Cohen’s work [13] formed the theoretical basis for this field. In the early 1980s, viruses
mainly spread through the exchange of floppy disks. At that time, only a small number of computer
viruses existed and virus infection was usually restricted to a local area. As computer networks and the
Internet became more popular from the late 1980s on, viruses quickly evolved to be able to spread through
the Internet by various means such as file downloading, email, exploiting security holes in software, etc.
Currently, email viruses constitute one of the major Internet security problems. For example, the
Melissa
virus in 1999, “
Love Letter
” in 2000 and “
W32/Sircam
” in 2001 widely spread throughout the
Internet and caused millions or even billions of dollars in damage [19][20][22]. There is, however, no
formal definition of “email virus” in the virus research area — any computer program can be called an
email virus
as long as it can replicate and propagate by sending copies of itself through email messages.
While
Melissa
is an email virus that only uses email to propagate [21], most email viruses can also use
other mechanisms to propagate in order to increase their spreading speed on the Internet. For example,

W32/Sircam
” can spread through unprotected network shares — the shared resources that others can
access through network [23]; “
Love Letter
” can propagate through Internet Relay Chat (IRC) or network
shares [25];
Nimda
can use four other mechanisms besides email to propagate [27].
2
Though virus spreading through email is an old technique, it is still effective and is widely used by
current viruses and worms. Sending viruses through email has some advantages that are attractive to
virus writers:

Sending viruses through email does not require any security holes in computer operating systems
or software.

Almost everyone who uses computers uses email service.

A large number of users have little knowledge of email viruses and trust most email they receive,
especially email from their friends [28][29].

Email are private properties like post office letters. Thus correspondent laws or policies are required
to permit checking email content for detecting viruses before end users receive email [18].
In order to understand how viruses propagate through email, in this paper we focus exclusively on
those email viruses that propagate solely through email, such as
Melissa
virus [21] (if we overlook its
slow spreading via file exchange). Thus “
email virus
” used in this paper is defined as a virus that only
spreads through email by including a copy of itself in the email attachment — an email user will be
infected only if he/she opens the virus email attachment. If the email user opens the attachment, the virus
program will infect the user’s computer and send itself as an attachment to all email addresses in the
user’s email address book.
A. Prior and related work
Considerable research has focused on detection and defense against email viruses. Anti-virus software
companies continuously add new techniques into their products and provide email virus defense software
such as SMTP gateway anti-virus system [17]. But little research has been pursued on modeling viruses
and worms propagation, not even to mention email viruses propagation.
Kephart, White and Chess of IBM published a series of papers from 1991 to 1993 on viral infection
based on epidemiology models [6][7][8]. [6][7] were based on a birth-death model in which viruses
were spread via activities mostly confined to local interactions. They further improved their model by
adding the “Kill signal” process and also considered the special model of viral spread in organizations
[8]. Though at that time the assumption of local interaction was accurate because of sharing disks, today
it’s no longer valid when most viruses and worms propagate through the Internet. In 2000 Wang et
al. studied a simple virus propagation model based on a clustered topology and a tree-like hierarchic
topology [9]. In their model, copies of the virus would activate at a constant rate without accounting
for any user interactions. The lack of a user model coupled with the clustered and tree-like topologies
make it unsuitable for modeling the propagation of email viruses over the Internet. Recently, Staniford et
al. studied
Code Red
worm propagation and presented several new techniques to improve the spreading
speed of worms [10]. The worm model considered in their paper assumes that a worm can directly reach
and infect any other computers, which is suitable for worms but not the case for email viruses — email
viruses must pass through an email network hop-by-hop.
Some researchers have studied immunization defense against virus propagation. Immunization means
that some nodes in a network are immunized and can not be infected by the virus or worm. Wang et al.
showed that selective immunization can significantly slow down virus propagation for tree-like hierarchic
topology [9]. From an email virus point of view, the connectivity of a partly immunized email network is
a percolation problem. Newman et al. derived the analytical solution of the percolation threshold of small
world topology [15][16]: if nodes are removed randomly from a small world network and the fraction of
these nodes is higher than the percolation threshold, the network will be broken into pieces. Albert et al.
were the first to explain the vulnerability of power law networks under attacks: by selectively attacking
3
the most connected nodes, a power law network tends to be broken into many isolated fragments [4].
The authors concluded that the power law topology was vulnerable under deliberate attack.
B. Our contributions
We present an email virus model that accounts for the behaviors of email users, such as email checking
frequency and the probability of opening an email attachment.
Our observation shows that the size of email groups follows a heavy-tailed distribution. Since email
network contains email groups, we believe an email network is also heavy-tailed distributed and we
model it as a power law network.
We carry out extensive simulation studies of email virus propagation. From these experiments we
derive a better understanding of the dynamics of an email virus propagation, how the degree of initially
infected nodes affects virus propagation, how the network parameters such as the power law exponent
affect virus behavior, etc.
For simplified email virus models, we mathematically prove that an email virus propagates faster as the
email checking time becomes more variable although the average email checking time does not change.
We know better of the differences among power law, small world and random graph topologies by
simulate email virus propagation on them.
We derive by simulations the selective percolation curves and thresholds for the power law, small world
and random graph topologies. These selective percolation curves can explain why selective immunization
defense against virus spreading is quite effective for a power law topology but not so good for the other
two topologies.
C. Organization of the paper
The rest of the paper is organized as follows: We present the email virus model in Section II. In
Section III we discuss the email network topology and model it as a power law topology. In Section
IV, we present simulation studies of email virus propagation without considering immunization. We also
compare virus spreading among power law, small world and random graph topologies. In Section V, we
study the immunization defense against email viruses and the corresponding percolation problem. Section
VI concludes this paper with some discussions.
II. E
MAIL VIRUS PROPAGATION MODEL
Because of the complexity of an email network and the randomness of email users’ behaviors, it’s
difficult to mathematically analyze email virus propagation. Thus in this paper we will rely primarily on
simulation rather than mathematical analysis. In this way we can focus on realistic scenarios of email
virus propagation.
In this paper, we consider email viruses that only transfer through users’ email address books. Thus
email address relationship between users’ address books forms a logical network for email viruses. Strictly
speaking, the email logical network is a directed graph: each vertex in the graph represents an email
user while a directed edge from node A to node B means that user B’s email address is in user A’s
address book. Email address book of a user usually contains the user’s friends’ or business partners’
email addresses. Thus if user A has user B’s address, user B probably also has user A’s address in his
own address book, which means that many of the directed edges on the email network point to both
direction. Although this may not always be true, we model the email network as an undirected graph in
this paper.
4
We represent the topology of the logical email network by an undirected graph
G
=
<V,E>
,
∀v ∈ V
,
v
denotes an email user and
∀e
=(
u, v
)
∈ E
,
u, v ∈ V
, represents two users
u
and
v
that have the email
address of each other in their own address books.
|V |
is the total number of email users.
Let’s first describe the email virus propagation scenario captured by our model: users check their email
from time to time. When a user checks his email and encounters a message with a virus attachment, he
may discard a message with a viral attachment (if he suspects the email or detects the email virus by
using anti-virus software) or open the virus attachment if unaware of it. When the virus attachment is
opened, the virus immediately infects the user and sends out virus email to all email addresses on this
user’s email address book. The infected user will not send out virus email again unless the user receives
another copy of the virus email and opens the attachment again.
From the above description, we see that email viruses, not like worms, depend on email users’
interaction to propagate. There are primarily two human behaviors affecting email viruses: one is the
email checking time
, denoted by
T
i
,
i
=1
,
2
, ··· , |V |
, the time interval when user
i
checks email; another
is the
opening probability
, denoted by
P
i
,
i
=1
,
2
, ··· , |V |
, the probability with which user
i
opens a
virus attachment.
The email checking time
T
i
of user
i
,
i
=1
,
2
, ··· , |V |
, is a random variable with an average email
checking time,
E
[
T
i
]
, determined by user
i
’s habits. We assume that when a user checks his email, he
checks all new email in his mailbox. The opening probability
P
i
of user
i
is determined by the user’s
awareness and knowledge of email viruses.
We assume that each user’s behaviors are independent of each other. We model
T
i
and
P
i
,
i
=
1
,
2
, ··· , |V |
, as follows:
Email checking time
T
i
of user
i
,
i
=1
,
2
, ··· , |V |
, is exponentially distributed with the mean
E
[
T
i
]
.

is itself a random variable, which we denote as
T
.

User
i
opens a virus attachment with probability
P
i
when he checks any virus email. Let
P
denote
the random variable that generates
P
i
,
i
=1
,
2
, ··· , |V |
We assume that
E
[
T
i
]
.
Since the number of email users,
, is very large and a user’s behaviors are independent of
others, we assume that
T
and
P
are independent Gaussian random variables, i.e.,
T ∼ N
(
µ
T

T
)
|V |

,
P ∼ N
(
µ
P

P
)
,
i
=1
,
2
, ··· ,N
.
An email user is called
infected
once the user opens a virus email attachment. Let
N
0
denote the
number of initially infected users that send out virus email to all their neighbors at the beginning of a
virus propagation. Let random variable
N
t
denote the number of infected users at time
t
during email
virus propagation,
N
0
≤ N
t
≤|V |
.
It takes time before a recipient receives a virus email sent out by an infected user. But the email
transmission time is usually much smaller comparing to a user’s email checking time (the time interval
between a user’s two consecutive email checking). Thus in our model we ignore the email transmission
time.
Table. I lists most of the notations used in this paper.
,
∀t>
0
III. E
MAIL NETWORK TOPOLOGY DISCUSSION
The email network is determined by users’ email address books. The size of a user’s email address
book is the degree of the corresponding node in the network graph. Since email address books are private
property, we have no such data to tell us what the email topology is. We have, however, examined the
sizes of the more than 800,000 email groups in
Yahoo!
[11]. Thus we can use it to figure out what
the topology might be like although the topology of email groups is not the complete email network
topology.
5
TABLE I
N
OTATIONS USED IN THIS PAPER
Notation Explanation
G
=
<V,E>
Undirected graph representing the email network.
v ∈ V
denotes an email user,
|V |
is user population.
T
i
Email checking time of user
i
— the time interval between user
i
’s two consecutive email
checking,
i
=1
,
2
, ··· , |V |
.
T
i
is exponentially distributed with mean value
E
[
T
i
]
.
E
[
T
i
]
Average email checking time of user
i
,
i
=1
,
2
, ··· , |V |
.
P
i
Opening probability of user
i
— the probability with which user
i
opens a virus attachment,
i
=1
,
2
, ··· , |V |
.
T
Gaussian-distributed random variable that generates
E
[
T
i
]
,
i
=1
,
2
, ··· , |V |
.
T ∼ N
(
µ
T

T
)
.
P
Gaussian-distributed random variable that generates
P
i
,
i
=1
,
2
, ··· , |V |
.
P ∼ N
(
µ
P

P
)
.
N
0
Number of initially infected users at the beginning of virus propagation.
N
t
Number of infected users at time
t
,
∀t>
0
.
E
[
N
t
]
Average number of infected users at time
t
,
∀t>
0
.
V
t
Number of virus email in the system at time
t
,
∀t>
0
.
Power law exponent of a power law topology that has complementary cumulative degree distribution
F
(
d
)
∝ d
−α
.
α
N

Number of users that are not infected when virus propagation is over.
D
t
Average degree of nodes that are healthy before time
t
but are infected at time
t
,
∀t>
0
.
C
(
p
)
Connection ratio — the percentage of remaining nodes still connected
after removal of the top
p
percent of most connected nodes from a network.
L
(
p
)
Remaining link ratio — fraction of links remaining after removal of the top
p
percent of most connected nodes.
As mentioned in Section II, we model the email network as an undirected graph
G
=
<V,E>
. Let
f
(
d
)
be the fraction of nodes with degree
d
in
G
. The complementary cumulative distribution function
(ccdf) is denoted by
F
(
d
)=
i
=
d
f
(
i
)
, i.e., the fraction of nodes with degree greater than or equal to
d
. We present the
Yahoo!
empirical ccdf of the group sizes for May 2002 in the log-log format Fig. 1.
Yahoo Group Size (May 2002)
1
0.1
0.01
0.001
0.0001
1e−5
1e−6
1
10
100
1,000
10,000
100,000
Group size (d)
Fig. 1. Complementary cumulative distr. of
Yahoo!
group size
Fig. 2. Illustration of a two-dimensional
small world network
The size of
Yahoo!
groups varies from as low as 4 to more than 100,000. From Fig. 1 we can see that
the size of Yahoo groups is
heavy-tailed distributed
, i.e., the ccdf
F
(
d
)
decays slower than exponentially
[12].
Currently, email lists, or called email groups, have become very popular. Once a user puts the address
of an email list in his address book, from the virus point of view, the address book virtually has all the
addresses contained in the email list. Since email groups are heavy-tailed distributed as shown in Fig. 1,
it is reasonable to believe that email network is also heavy-tailed distributed.
  [ Pobierz całość w formacie PDF ]
  • zanotowane.pl
  • doc.pisz.pl
  • pdf.pisz.pl
  • upanicza.keep.pl