8   7   Computing PageRank on Big Graphs 10 18 Advanced
Articles,  Blog

8 7 Computing PageRank on Big Graphs 10 18 Advanced

So this is really the,
the implementation of how we would do this if everything fits in memory, meaning
if we can store Graph G memory, and if we can store the parameter, beta and
the arrange vector R in memory. Okay. So now of course what I
have to tell you is I, I said that, the, we can really take,
take advantage of matrix M being sparse. So what this means is that we want
to encode the sparse matrix M using only nonzero entries. And you kind of don’t want to store
on the disk the zero entries. So the way we do this,
is to perceive the following. Is the idea is, basically for
every source node, we restore it’s degree. And then,
just the nodes that this node points to. So in this sense, basically the amount
of storage we need is proportional to the number of in our graph. And not to the number of notes, right? Which means we can have a huge number of
notes, but all you really need to store, is we need to store the height,
the lengths or our graph. And we don’t need to store or pay any cost to the links that
do not exist in our graph. Since the real graphs are sparse, meaning that most of the links
in the network don’t exist,. this, this way we save a lot of space,
right? So for example, if we are working
with our one billion node graph and imagine that we have 10 billion, then this
means we will need around 40 gigabytes to, to store, to store the,
the links of the graph. And ten, ten gigabytes is, sorry,
40 gigabytes is very little in a sense. So, so far we assume that
everything fits into memory. What, for example, if matrix M, if our graph is so big that it doesn’t
fit into a memory of a single machine? Actually, it turns out that even with,
n matrix not fitting into memory, we can still quite efficiently
implement the page link. So let’s make first
the following assumption. Let’s assume that what we can
fit in memory is the vector r. So M is too big to be fit in memory. But r is small enough. So then actually everything is very,
very easy and very simple. Right.
So the idea is that we will have the, r old stored on the disk. And we will have the matrix
M stored on the disk. And then the way we will proceed
is basically to do one step of power-iteration. One step of our PageRank method. What we will do, is we will basically, do the step where we will read from
the memory a given, a given node. Its out nodes will have the vector,
r in memory and we can update its, its components. Right, so the idea basically is that
we keep the new vector r in memory. We, we keep scanning through the matrix M,
row by row. We load in a given node and
all the node it point, points to. We load its corresponding page rank score. Do the multiplication. Basically spread this page rank score of
node, of our node i into the vector r new. And again basically what this
means is that to do one, one iteration of this we need to
read matrix r m one time from disc. Right? So the way we can think about
this is the following, right? So at every step what we need to do is
we need to scan the matrix M into the, into the, into the memory, and we have to scan the rank,
the rank factor r old into memory. So this means that, that, once we,
one, once we have done this, we have also re, we have to ride the,
our new back to the, back to the disk so that the new iteration of our
PageRank algorithm, can start. So, if you think about the amount of
input output operations we have to do. The, the amount of input output we have
to do is 2 times the size of r plus M, right, so th e size of M,
the number of edges. So 2 times r is twice t
he number of nodes, and size of the M is the number
of edges of the network. Right, so in some sense,
every initial, every iteration of our, our algorithm,
we need to scan once through the matrix M. Now, the question is,
can we scale up this even further, right? Could we, could we take this algorithm and
scale it even further in the sense that, what if, what we have so many nodes, not,
not even our vector R fits into memory. Can we still, can we still do something? So now we will look how
to scale even further up. The idea here, is that we will take
our vector R and split it into blocks. And now I will just explain
the basic idea and then we will refine this basic idea into something
that really works well in practice. So here’s how we can think of the problem,
right? Before we assume that our vector our mew,
fits in memory. Now we are assuming that the only
part of rnew fits into memory. So what this means is that we can
now compute in one iteration of, of pagerank, of the power method. What we, what we do is we load
a part of our rnew into memory, scan through the matrix m,
scan through the, scan through the mat,
rank vector r, and compute. The correct pagening scores for a,
for a given,um, part of the vector R. And then we would, load in the next
part of vector R, scan again through the matrix, and scan again through
the R old, and update the values here. And then we, we would load the next block,
again scan through M, scan, scan through R, and so on. And this way, we would basically be
able to do, compute the, all the values of R new, in, in some number of passes,
over n and over the vector r. however, let’s think about how well,
how well this works. Right? So this, the way we think about this is, this is very similar to what in databases
we call a, nested-loop, loop join. So the idea is that we take this vector
r new, and we break it into k blocks. And then for every block we scan,
matrix M, and our our old ones, so if you think about how many,
how much input output time we need or operations we need to do,
here is a simple calculation, right, so basically we need to do k scans of,
r and M. So this means that we have k
times size of M, plus size of r, plus we also have to be able to take this
r new and write it back to memory or write blocks of it back to memory. So this means we have the k,
the loading k times the matrix M and we, load, or, vector r k plus one times. The question is, of course,
can we do better? And actually we cou, we can do better. And the trick to do better is to
basically preprocessed matrix M. So the insight or the or, the for
us to do this is the following. Basically the matrix M is much
bigger than the vector R. Right?
So m matrix M is lets twenty or thirty times bigger than vector R. So in some sense we want to avoid reading
matrix M into memory multiple times. So we are happy to scan over r, but
scanning over M takes lots of time. So what we want to do is, we want to
preprocess M in some way such as we don’t need to load it multiple times. And the way to preprocess M, is what is
called a block-stripe of data algorithm. So what we did is that we took the matrix
M, and we preprocessed it into these chunks that correspond
actually to the chunks of r new. And the idea is the following, right? So for every chunk of matrix m, we only
store the destination address that are, that, that have destinations in our block,
in the corresponding block of the, our new vector. Right?
So, for example, what this means, is that now when we are updating
the values of r new in the first bar, all we have to do is scan over
the first part of the matrix set. And then when we want to update
the values are new in the second block, we only need to scan through this part
of the matrix M and then same for the third block and so on. And what this saves us from
is that now we basically, in a single iteration of the bower method. what we need to do is, we need to do,
in some sense, one scan over matrix M. Of course, there is a bit of overhead
because now we have to store the degree of multiple times, but
that is relatively small overhead overall. And actually this is worth it in,
it’s worth making this in practice. Right? So the idea is that we take matrix M and
break it into stripes. And each stripe contains only
the destination nodes, corresponding to the block of, the, the, our new, or
to the block of nodes in our new vector. and, this, as I mentioned, causes some, brings some additional
overhead per stripe. But generally, it’s,
very much worth it to have that. And now, if you think about, what is
the input/output cost per iteration. We basically see that we need to read,
the matrix M, once. Plus there is some extra additional cost. And then we need to, as before, read
the vector r, k times, and write it once. So that’s why we get,
k plus 1, times size of r. So let me summarize what we know so
far and what we have done. We defined page rank, and we are able to compute it on kind
of arbitrarily large graphs. There are also a few problems with page
rank that we are going to address next. First, PageRank as defined so far is a generic measure
of an importance of a page. For example, for every node in the graph,
we only compute one score. And we don’t really say whether
a given page is good or important with respect to a given topic. So what we will, what we will do next as,
as a solution to this, we will talk about what is
called topic-specific PageRank. The second thing that is also. Is that we together with
PageRank there are other ways to think about importances of nodes. And one way how to compute
importances of nodes in a, in a bit different way,
is called hubs-and-authorities. So this is what is also coming. And then the, the last thing we will, we will discuss is that PageRank is
susceptible to what is called link spam. So the idea is that we have
spammers on the web who want to manipulate the structure of the web graph, such that they boost page rank
score of individual target pages. So the question is, how can we
detect spammy web pages on the web? And the solution for
this is called TrustRank. So what we’ll do next is we will go
through each one of these three, applications one by one.

Leave a Reply

Your email address will not be published. Required fields are marked *