So this is really the,

the implementation of how we would do this if everything fits in memory, meaning

if we can store Graph G memory, and if we can store the parameter, beta and

the arrange vector R in memory. Okay. So now of course what I

have to tell you is I, I said that, the, we can really take,

take advantage of matrix M being sparse. So what this means is that we want

to encode the sparse matrix M using only nonzero entries. And you kind of don’t want to store

on the disk the zero entries. So the way we do this,

is to perceive the following. Is the idea is, basically for

every source node, we restore it’s degree. And then,

just the nodes that this node points to. So in this sense, basically the amount

of storage we need is proportional to the number of in our graph. And not to the number of notes, right? Which means we can have a huge number of

notes, but all you really need to store, is we need to store the height,

the lengths or our graph. And we don’t need to store or pay any cost to the links that

do not exist in our graph. Since the real graphs are sparse, meaning that most of the links

in the network don’t exist,. this, this way we save a lot of space,

right? So for example, if we are working

with our one billion node graph and imagine that we have 10 billion, then this

means we will need around 40 gigabytes to, to store, to store the,

the links of the graph. And ten, ten gigabytes is, sorry,

40 gigabytes is very little in a sense. So, so far we assume that

everything fits into memory. What, for example, if matrix M, if our graph is so big that it doesn’t

fit into a memory of a single machine? Actually, it turns out that even with,

n matrix not fitting into memory, we can still quite efficiently

implement the page link. So let’s make first

the following assumption. Let’s assume that what we can

fit in memory is the vector r. So M is too big to be fit in memory. But r is small enough. So then actually everything is very,

very easy and very simple. Right.

So the idea is that we will have the, r old stored on the disk. And we will have the matrix

M stored on the disk. And then the way we will proceed

is basically to do one step of power-iteration. One step of our PageRank method. What we will do, is we will basically, do the step where we will read from

the memory a given, a given node. Its out nodes will have the vector,

r in memory and we can update its, its components. Right, so the idea basically is that

we keep the new vector r in memory. We, we keep scanning through the matrix M,

row by row. We load in a given node and

all the node it point, points to. We load its corresponding page rank score. Do the multiplication. Basically spread this page rank score of

node, of our node i into the vector r new. And again basically what this

means is that to do one, one iteration of this we need to

read matrix r m one time from disc. Right? So the way we can think about

this is the following, right? So at every step what we need to do is

we need to scan the matrix M into the, into the, into the memory, and we have to scan the rank,

the rank factor r old into memory. So this means that, that, once we,

one, once we have done this, we have also re, we have to ride the,

our new back to the, back to the disk so that the new iteration of our

PageRank algorithm, can start. So, if you think about the amount of

input output operations we have to do. The, the amount of input output we have

to do is 2 times the size of r plus M, right, so th e size of M,

the number of edges. So 2 times r is twice t

he number of nodes, and size of the M is the number

of edges of the network. Right, so in some sense,

every initial, every iteration of our, our algorithm,

we need to scan once through the matrix M. Now, the question is,

can we scale up this even further, right? Could we, could we take this algorithm and

scale it even further in the sense that, what if, what we have so many nodes, not,

not even our vector R fits into memory. Can we still, can we still do something? So now we will look how

to scale even further up. The idea here, is that we will take

our vector R and split it into blocks. And now I will just explain

the basic idea and then we will refine this basic idea into something

that really works well in practice. So here’s how we can think of the problem,

right? Before we assume that our vector our mew,

fits in memory. Now we are assuming that the only

part of rnew fits into memory. So what this means is that we can

now compute in one iteration of, of pagerank, of the power method. What we, what we do is we load

a part of our rnew into memory, scan through the matrix m,

scan through the, scan through the mat,

rank vector r, and compute. The correct pagening scores for a,

for a given,um, part of the vector R. And then we would, load in the next

part of vector R, scan again through the matrix, and scan again through

the R old, and update the values here. And then we, we would load the next block,

again scan through M, scan, scan through R, and so on. And this way, we would basically be

able to do, compute the, all the values of R new, in, in some number of passes,

over n and over the vector r. however, let’s think about how well,

how well this works. Right? So this, the way we think about this is, this is very similar to what in databases

we call a, nested-loop, loop join. So the idea is that we take this vector

r new, and we break it into k blocks. And then for every block we scan,

matrix M, and our our old ones, so if you think about how many,

how much input output time we need or operations we need to do,

here is a simple calculation, right, so basically we need to do k scans of,

r and M. So this means that we have k

times size of M, plus size of r, plus we also have to be able to take this

r new and write it back to memory or write blocks of it back to memory. So this means we have the k,

the loading k times the matrix M and we, load, or, vector r k plus one times. The question is, of course,

can we do better? And actually we cou, we can do better. And the trick to do better is to

basically preprocessed matrix M. So the insight or the or, the for

us to do this is the following. Basically the matrix M is much

bigger than the vector R. Right?

So m matrix M is lets twenty or thirty times bigger than vector R. So in some sense we want to avoid reading

matrix M into memory multiple times. So we are happy to scan over r, but

scanning over M takes lots of time. So what we want to do is, we want to

preprocess M in some way such as we don’t need to load it multiple times. And the way to preprocess M, is what is

called a block-stripe of data algorithm. So what we did is that we took the matrix

M, and we preprocessed it into these chunks that correspond

actually to the chunks of r new. And the idea is the following, right? So for every chunk of matrix m, we only

store the destination address that are, that, that have destinations in our block,

in the corresponding block of the, our new vector. Right?

So, for example, what this means, is that now when we are updating

the values of r new in the first bar, all we have to do is scan over

the first part of the matrix set. And then when we want to update

the values are new in the second block, we only need to scan through this part

of the matrix M and then same for the third block and so on. And what this saves us from

is that now we basically, in a single iteration of the bower method. what we need to do is, we need to do,

in some sense, one scan over matrix M. Of course, there is a bit of overhead

because now we have to store the degree of multiple times, but

that is relatively small overhead overall. And actually this is worth it in,

it’s worth making this in practice. Right? So the idea is that we take matrix M and

break it into stripes. And each stripe contains only

the destination nodes, corresponding to the block of, the, the, our new, or

to the block of nodes in our new vector. and, this, as I mentioned, causes some, brings some additional

overhead per stripe. But generally, it’s,

very much worth it to have that. And now, if you think about, what is

the input/output cost per iteration. We basically see that we need to read,

the matrix M, once. Plus there is some extra additional cost. And then we need to, as before, read

the vector r, k times, and write it once. So that’s why we get,

k plus 1, times size of r. So let me summarize what we know so

far and what we have done. We defined page rank, and we are able to compute it on kind

of arbitrarily large graphs. There are also a few problems with page

rank that we are going to address next. First, PageRank as defined so far is a generic measure

of an importance of a page. For example, for every node in the graph,

we only compute one score. And we don’t really say whether

a given page is good or important with respect to a given topic. So what we will, what we will do next as,

as a solution to this, we will talk about what is

called topic-specific PageRank. The second thing that is also. Is that we together with

PageRank there are other ways to think about importances of nodes. And one way how to compute

importances of nodes in a, in a bit different way,

is called hubs-and-authorities. So this is what is also coming. And then the, the last thing we will, we will discuss is that PageRank is

susceptible to what is called link spam. So the idea is that we have

spammers on the web who want to manipulate the structure of the web graph, such that they boost page rank

score of individual target pages. So the question is, how can we

detect spammy web pages on the web? And the solution for

this is called TrustRank. So what we’ll do next is we will go

through each one of these three, applications one by one.