In the previous unit, what I showed you was a way to rescue MaxEnt models to describe the language abundance P(n) by reference to a hidden variable, epsilon which we referred to as “programmer time” and what we assumed was that the system was constrained in two ways: it was constrained not only to having a certain average number of projects per language. We fixed the average popularity of languages, but we also fixed the average programmer time devoted to projects in a particular language. And so this

distribution, here, looks like this. in functional form. And when we integrate out this variable, epsilon, we get

something that looks like this. So we get a different prediction for the language distribution. And we get a prediction that, I argue, looks

reasonably good. It certainly looks better than the

exponential distribution. I feel honor-bound to tell you

about the controversy that happens when we try to make

these models… and in particular there is a very different mechanistic

model that looks quite similar. So this is the Fisher log series. And the argument behind the Fisher log

series, to explain this distribution, involves the idea of a “hidden” additional

constraint. In the open source question, what I’ve

done is describe that additional constraint as “programmer time”, just

because it seems like it might be a constraint in the system. OK?

That the average programmer time for languages is fixed, not just the

average number of projects. But, and so that means that languages

can vary in their popularity, but also in their efficiency. In the ecological

modelling, this is “species” languages are species, and this is the

abundance of species. So that means the number of particular

instances of the species in the wild. And also metabolic rate: how much energy

a particular species consumes. And so in that case, the system is

constrained to a certain average species abundance, and a certain average

species energy consumption Here languages are constrained to a certain

abundance, and a certain consumption of programmer energy. That’s the analogy. So, we can also build a mechanistic model. of programming language popularity. And previously, when we studied the taxi-cab

problem, what we did was, when we produced the mechanistic model,

we were able to find a very simple one that had the same predictions as the

MaxEnt model did. Here, by contrast, what we’re going to find

is that the mechanistic model is going to produce similar behavior, but

the functional form will actually be slightly distinct. So, here’s the mechanistic

model… we imagine that languages all start out with a baseline popularity. Whoever invents the language,

for example, has to write at least one project. There is at least one

programmer at the beginning of a language’s invention, who knows how to program in that language, somewhat by definition. And so there’s two ways that that popularity can grow. It can grow, for example, linearly. So, on day 1 there’s 1 programmer. And on day 2, that one programmer is

joined by another programmer. And on day 3, those two programmers

are joined by a third. And so, over time, what you have is a growth rate that’s linear… in time. But, perhaps a more plausible model for how languages accrue popularity is multiplicative. At time 1, there’s 1 programmer, and he has some efficiency of converting other programmers to his cause. So maybe he’s able to double the number of programmers. And he’s able to double the number of

programmers, because his language is particularly good, and perhaps perhaps

people who like to program in that language happen to be particularly persuasive. And so on the second day, those two

programmers each themselves go out and convert two people,

because they are the same as the original programmer in their effectiveness and the language itself is just as

convincing as it was before. So each of those two programmers goes out and gathers two for each and we go to 4 And by a similar argument, we go to 8, and so this would be the

exponential growth model… where the number of programmers

as a function of time increases multiplicatively, as opposed to additively. So, let’s make this model a little more realistic, and in particular, let’s allow

the multiplicative factor, which in this case we set to 2, we’re going to allow this multiplicative factor

to vary. And in fact, we’re going to draw this multiplicative factor, alpha, from some distribution. And in fact, it doesn’t really

matter what that distribution is as long as alpha is always greater than 0, so it’s not possible for all programmers

to suddenly disappear. so it’s always greater than zero, and it’s bounded at some point, so it’s impossible for a language to become infinitely popular after a finite number of steps. So we’re going to draw… each day we’re going to draw a number,

alpha, from this distribution, here. So, after one day, there are alpha programmers. After two days, there is alpha… or rather

alpha(1) programmers. (This is the draw on the first day.) On the second day

there’s alpha(2) times alpha(1) programmers, and so on.

alpha(3) times alpha(2) times alpha(1) So this is now growth that occurs through a random multiplicative process. It’s similar to growth that would happen through a random additive process except now, instead of adding a random

number of programmers each day you multiply the total number of programmers each day, by some factor alpha drawn from this distribution. So, you can always convert this multiplicative process into an additive process,

by a very simple trick of taking the logarithm. Over time, if we count programmer numbers we’re multiplying, but if you’re working in

log space, we’re just adding. We’re adding a random number to the distribution, as long as alpha is always strictly greater than zero, these will

always be well defined. Now, all of a sudden, it looks like the

additive model in log space. And what we know from the central limit

theorem, is that if you add together lots of random numbers, that distribution

tends towards a Gaussian distribution. with some particular mean (mu) and some

particular variance (sigma). Let’s not worry about what mu and sigma are in particular, but rather note that

that growth happens in log space The distribution of these sums over long time scales will end up looking like a

Gaussian distribution. The average boost per day to a language looks in log space like a Gaussian

distribution. What that means is that the exponential growth model with random kicks, random multiplicative

kicks, actually looks like a Gaussian in log space, or what we call

a log-normal in actual number space So, instead of looking at the logarithm of

the popularity of the language, just look at the total popularity of the language, and what that means is that it looks like the exponential of

log(n) minus some mean squared over two sigma squared… and then you just have to be careful to normalize

things properly here. So, this is the log-normal distribution. And a mechanistic model where language

growth happens multiplicatively where a language gains new adherents in proportion to the number of adherents it already has, where a language gains new projects in proportion to the number

of projects it already has dependent upon the environment –

that’s where the multiplicative randomness comes from. Alpha is a random number it’s not a constant. It’s not 2. It’s not

the language always necessarily doubles. But the fact that it grows through a multiplicative random process as opposed to an additive process means that you have a log-normal growth. And so now you can say, “OK, let’s imagine that languages grow through this

log-normal process. And let’s find the best fit parameters for mu and sigma.” And if you do that, you find that the mechanistic log-normal model looks pretty good as well. We were impressed by how well

the blue line fit this distribution compared to the red exponential model the MaxEnt model, constraining only N That was the red model. Here the blue model does well… this is the Fisher-log

series. Unfortunately, a mechanistic model… and

I’ve given you a short account of the mechanistic model, here. Where what’s happening is you’re adding together lots of small multiplicative

random kicks The mechanistic model also works [?] I will tell you that this fits better. If you do a statistical analysis, both of

these models have two parameters If you do a statistical analysis, the

Fisher-log series actually fits better in particular, it’s able to explain these really high-popularity languages better these deviations here seem larger than the deviations here, but you have to remember that this is on a log scale So this gets much closer up here than this does here. So the mechanistic model, at least

visually, looks like it’s extremely competitive, with the Fisher log

series model derived from a MaxEnt argument Statistically speaking, if you look at

these two, this one is actually slightly dispreferred. But like many people, what you want is some ironclad evidence for one versus the other. And I think the best way to look for that kind of evidence is to figure out what, if anything, this epsilon really is

in the real world. If we were able to build a solid theory about what epsilon was, and how we could measure it in the data, then we could see if this here, this joint distribution, was

well reproduced. If we could find evidence, for example, for the fact that these two co-vary. That here we have a term that boosts the popularity of a language if it becomes more efficient. So, if this goes down, this can get higher, and the language can still have the same probability of being found with those properties. And of course, the problem is that we don’t know how to measure this, sort of,

mysterious programmer time… programmer efficiency. The ecologists have a much better time

with this. Because the ecologists, they know what

their epsilon is. They know that their epsilon is metabolic

energy units intake. So this is “how much a particular instance

of this species consumes in energy over the course of a day, or

over the course of its lifetime.” And they’re able to measure that, and in fact, they’re able to measure this joint distribution. If we come to study the open source ecosystem, so far we don’t really have a way to measure this and so we’re unable to measure the joint and so now we’re left with one model that’s mechanistic, right, popularity accrual model and over here, this model that talks about there being two constraints on the system Average number and average

programmer time