My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
PowerLaw  
Python implementation of a power-law distribution fitter.
powerlaw
Updated Feb 3, 2012 by keflavich

Power-law Distribution Fitting

Aaron Clauset et al. address the issue of fitting power-laws to distributions on this website and in their paper Power-law distributions in empirical data. I have created a python implementation of their code because I didn't have matlab or R and wanted to do some power-law fitting.

Power-laws are very commonly used in astronomy and are typically used to describe the initial mass function (IMF), the core mass function (CMF), and often luminosity distributions. Most distributions in astronomy tend to be apparent power-laws because the source counts are too few or too narrow to distinguish powerlaws from log-normal and other distributions. But, to this end, I've included the testing mechanism to test for consistency with a power law as described in the above paper.

Only the continuous case is implemented in this case; my research interests do not (yet) require the discrete distribution.

The python internal documentation is complete. A brief description of relevant functions is included here for convenience:

plfit is implemented as a class. This means that you import plfit, and declare an instance of the plfit class: import plfit X = rand(1000) myplfit = plfit.plfit(X) The results of the fit are printed to the screen (if desired) and are stored as part of the object.

alpha_ and kstest_ are functions used internally to determine the ks-statistic and alpha values as a function of xmin.

There are 3 predefined plotting functions:

  • alphavsks plots alpha on the y-axis vs. the ks statistic value on the x-axis with the 'best-fit' alpha value plotted with error bars. These plots are a useful way to determine if other values of xmin are similarly good fits.
  • plotcdf plots the cumulative distribution function along with the best-fit power law
  • plotpdf plots a histogram of the PDF with the best fit power law. It defaults to log binning (i.e. a linear power-law fit) but can do dN/dS and linear binning as well.

test_pl uses the fitted power-law as the starting point for a monte-carlo test of whether the powerlaw is an acceptable fit. It returns a "p-value" that should be >0.1 if a power-law fit is to be considered (though a high p-value does not ensure that the distribution function is a power law!).

plexp_inv creates a cutoff power-law distribution with an exponential tail-off. It is useful for tests. pl_inv creates a pure cutoff power-law distribution test_fitter uses the previous two functions to test the fitter's ability to return the correct xmin and alpha values for large numbers of iterations

The powerlaw fitter is very effective at returning the correct value of alpha but not as good at returning the correct value of xmin.

There are 3 implementations of the code internals. fplfit.f is a fortran function, cplfit.pyx is a cython function, and plfit.py is the wrapper and includes a python-only implementation that requires numpy. FORTRAN is fastest, follow closely by cython. Python is ~3x slower.

As of November 21, 2011, there is a pure python (i.e., no numpy) implementation at plfit_py.py. It's slower and hobbled, but it works, and perhaps will run fast with pypy.

To install the cython function, run: python setup.py build_ext --inplace

To install the fortran function, run: f2py -c fplfit.f -m fplfit

For usage examples, view speedcompare_plfit.py, clauset2009_tests.py, plfit_tests.py.

A very simple example:

import plfit
from numpy.random import rand,seed

# generate a power law using the "inverse" power-law generator code
X=plfit.plexp_inv(rand(1000),1,2.5)

# use the numpy version to fit (usefortran=False is only needed if you installed the fortran version)
myplfit=plfit.plfit(X,usefortran=False)
# output should look something like this:
# PYTHON plfit executed in 0.201362 seconds
# xmin: 0.621393 n(>xmin): 263 alpha: 2.39465 +/- 0.0859979   Log-Likelihood: -238.959   ks: 0.0278864 p(ks): 0.986695

# generate some plots
from pylab import *
figure(1)
myplfit.plotpdf()

figure(2)
myplfit.plotcdf()

If you use this code, please cite Clauset et al 2009 and consider posting a comment below.

Direction citations to the source are welcome! The python translation has been cited in the following works (and perhaps others?):

Comment by project member keflavich, Dec 7, 2009

I believe you can also install it by running setup.py now.

Comment by project member keflavich, Feb 28, 2010

There are two known bugs. If compiling the fortran or c versions fails, you may get a bus error. In this case, follow the advice given here: http://code.google.com/p/agpy/issues/detail?id=1

Comment by project member keflavich, Mar 3, 2010

Update: I can't get setup.py to work; I'll keep trying but I could use some help. Cython documentation just doesn't help me that much.

The default code now works as pure python and doesn't try to include the fortran or cython versions. However, they are faster, so if you want to use them, get in touch with me and I'll try to help.

Comment by Youmna.B...@gmail.com, Apr 14, 2010

Thanks for the implementation, it's very helpful. I'm trying a simple usage example MyPL = plfit(mydata) MyPL.plotpdf(log=True)

and I'm getting the following error: TypeError??: unique() got an unexpected keyword argument 'return_index' Would you have any idea why is that?

Thanks

Comment by Youmna.B...@gmail.com, Apr 14, 2010

Can you please tell me which version of Numpy you've got? Thanks

Comment by Youmna.B...@gmail.com, Apr 14, 2010

Ok I had to use numpy.unique1d() as this is the one that accepts return_index as an argument.

Comment by project member keflavich, Jun 9, 2010

Right. Sorry about the delayed reply - I don't get informed when comments are posted. unique1d is deprecated in numpy 1.4. I'm using 1.4.0rc2.

Comment by kill...@gmail.com, Jul 13, 2010

Can anybody help me how to get it running under win 7 64 bit? I have so many problems.

Comment by project member keflavich, Jul 14, 2010

Yikes, sorry killver - I haven't used windows in years. Is the problem with my code, or just some sort of dependency hell in windows python?

Comment by kill...@gmail.com, Jul 16, 2010

I just isntalled python and all modules with 32bit and now it works! Thanks

Comment by dirkh...@gmail.com, Aug 19, 2010

Hi,

As a double check I used the same data in Matlab (From Clauset's site) and in the python implementation. Somehow I get different results for the alpha and xmin value. The difference are substantial (alpha 2.94 in matlab and 1.97 in python). Any idea where these differences come from?

Comment by project member keflavich, Aug 19, 2010

No. That's a problem. Could you send me the data set and your results? I don't have access to matlab at the moment. It's likely to be a data format issue, but if it's not I need to know about it & debug it.

Comment by keith.br...@gmail.com, Jan 30, 2011

I'd like to use plfit but am having problems assembling an environment in which it will work. In particular, I can't work out where the module "pylab" comes from. It looks as if it might come form SciPy?, is that right? I'm currently chasing my tail installing various versions of numpy and scipy to try and find a combination that works.

Could you please leave a note of the prerequisites for running plfit?

Thanks, Keith

Comment by project member keflavich, Jan 30, 2011

pylab is part of the matplotlib package: http://matplotlib.sourceforge.net/. It is needed for all of the plotting routines, but is also a convenient wrapper of many numpy routines.

Comment by keith.br...@gmail.com, Feb 1, 2011

So, following the adage that free sofware is only free if your time is worthless I gave up on trying to construct an environment myself and grabbed the Enthought distribution. Plfit now works for me.

Can't get any of the plotting functions to work, but the numerical results are really what I was after: very nice!

Comment by sru...@gmail.com, Mar 23, 2011

I think you meant to write

f2py -c fplfit.f -m fplfit

rather than

f2py -f fplfit.f -m fplfit

near the end of the description above?

Comment by project member keflavich, Mar 24, 2011

Good catch sru, thanks. It is stated correctly in setup.py, but was not here.

Comment by jeffalst...@gmail.com, Aug 31, 2011

This is a wonderful resource! Thank you so much for not only writing it, but making it freely available. I do have one question regarding the log-likelihood ratio test with the lognormal distribution:

In plfit.py we have

406 self.lognormal_likelihood = scipy.stats.lognorm.nnlf(fitpars,self.data) 
407 self.power_lognorm_likelihood = 2*(self.likelihood + self.lognormal_likelihood)

I would have expected the log-likelihood ratio to be computed as:

407 self.power_lognorm_likelihood = self.likelihood - self.lognormal_likelihood

I am, however, unfamiliar with the exact behavior of nnlf, as it doesn't have great documentation on Scipy. Does nnlf actually produce something different for what is advertised?

Comment by project member keflavich, Aug 31, 2011

Jeff - Good question. I don't think that factor of 2 belongs - I don't know where it came from. However, from my reading of the definition of the log-likelihood and scipy's computation of the log-likelihood, the sign conventions are opposite - Clauset maximizes the likelihood, Scipy minimizes the negative-log-likelihood. At least, I think that's what's going on. It would be nice to test this out against data where we know what the likelihood ratio should be....

Anyway, check out the latest commit. I've documented that section.

Comment by jeffalst...@gmail.com, Sep 2, 2011

Thanks for the rapid edit! However, I'm confused by the signs: If L maximized, plfit for verbose should say:

The log of the Likelihood (the maximized parameter),  Log-Likelihood: ... 

right?

And if self.lognormal_likelihood is a negative value, then

self._likelihood + self.lognormal_likelihood > 0

if a power law is preferred, and

self._likelihood + self.lognormal_likelihood < 0

if the lognormal distribution has a greater likelihood, right?

Is this an indication that we're misunderstanding something clever you did in the past (and perhaps why the factor of two was in there) or that the bug of flipped signs has percolated into a few different areas?

Cheers!

Comment by jeffalst...@gmail.com, Sep 2, 2011

Note: The interpretation of the sign of the loglikelihood ratio is relevant for the output of self.lognormal. Ex.

myplfit.lognormal()
Lognormal KS D: 0.0231904  p(D): 0.227891   Likelihood Ratio (<0 implies power law better than lognormal): 30675.2

Or am I misunderstanding something?

Comment by project member keflavich, Sep 2, 2011

Good catch. Yes, it is the maximized parameter, not the minimized, according to Clauset et al.

As for the interpretation of the likelihood ratio... I really need to re-read / re-derive it to be sure I understand what's going on. That'll take me a few days to get around to, but I'll comment thoroughly when I do.

Comment by joe.jord...@gmail.com, Nov 21, 2011

Hi Keflavich,

Thanks very much for this useful little library! However, if you could clarify something on the description, it would make things a bit clearer.

Basically, requiring numpy is not the same thing as a pure python implementation - there are many situations where the deploying of pure python scripts is perfectly possible but the compilation of extension modules isn't (e.g. scripting frameworks for redistributable applications, like Titanium Desktop or SL4A (scripting layer for android.))

I don't mean to make this sound less awesome - it is very awesome! - but if you could make it clearer in the description that there is no "python-only implementation" it would be less misleading (took me a little while of googling / peeking at your code to work out I couldn't use it :( )

Thanks very much.

Comment by project member keflavich, Nov 21, 2011

Hi Joe, point well taken. It made me realize that most of the code was easily translated into numpy-free pure python. New version here: http://code.google.com/p/agpy/source/browse/trunk/plfit/plfit_py.py. It's a good deal slower than previous versions, as you'd expect, and I haven't tested it extensively. It is also missing the plotting & statistics capabilities that require matplotlib, numpy, and scipy. However, I hope it fits your needs better... give it a shot and let me know.

Comment by michaely...@gmail.com, Jan 30, 2012

Hi,

First of all, many thanks for the awesome implementation. I'm not a physicist but a social network analyst and this implementation of Clauset et al was extremely helpful for me because power-law distribution is frequent in network degree distribution. I'm sure this implementation will be (or at least should be) widely used for researchers in social network research!

However, I have a question which I hope you would be able to help me with. What if the distribution is a power-law distribution but with an exponential cut-off? How might I be able to estimate cut-off point using this code?

Thanks in advance.

Comment by project member keflavich, Jan 30, 2012

Actually, the code is meant to fit a power-law distribution with a cutoff. The 'xmin' parameter tells the point above which the data are represented by a power-law. Below that, they may follow an exponential or some other distribution.

Comment by spa...@gmail.com, Mar 29, 2012

Hi, thanks for the package. Incredibly useful. I needed to get the likelihood out of plfit() and was stumped that in some cases there was no .likelihook attribute. Debugging revealed that this was in cases where plfit() decided to use the discrete fit, and used discrete_best_alpha() internally, which set self.alpha and self.minx, but not self.likelihood. Would it be possible to assign the best_likelihood variable to self.likelihood? I do see that these are marked for internal use, and it would be nice if there were an "official" way to get these useful measures (same for the ks values) out of plfit().


Sign in to add a comment
Powered by Google Project Hosting