To correlate or not correlate…that is the question

In statistics, dependence is any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price.

Correlations are useful because they can indicate a predictive relationship that can be exploited in practice. For example, an electrical utility may produce less power on a mild day based on the correlation between electricity demand and weather. In this example there is a causal relationship, because extreme weather causes people to use more electricity for heating or cooling; however, statistical dependence is not sufficient to demonstrate the presence of such a causal relationship (i.e., correlation does not imply causation).

Formally, dependence refers to any situation in which random variables do not satisfy a mathematical condition of probabilistic independence. In loose usage, correlation can refer to any departure of two or more random variables from independence, but technically it refers to any of several more specialized types of relationship between mean values. There are several correlation coefficients, often denoted ρ or r, measuring the degree of correlation. The most common of these is the Pearson correlation coefficient, which is sensitive only to a linear relationship between two variables (which may exist even if one is a nonlinear function of the other).

When you get an r value = 1 there is a strong positive correlation between the two events as one gets bigger so does the other and r=-1 means as one gets bigger the other gets smaller and r=0 there is no correlation. So often we see correlations of r=0.4-0.6 between this outcome and that outcome and so rather means that there is so much information missing in this weak correlation, that it is difficult to make any biological sense of the relationship of the two events
Basnyat P, Hagman S, Kolasa M, Koivisto K, Verkkoniemi-Ahola A, Airas L, Elovaara I.Association between soluble L-selectin and anti-JCV antibodies in natalizumab-treated relapsing-remitting MS patients. Mult Scler Relat Disord. 2015;4:334-8

OBJECTIVE: In relapsing-remitting MS (RRMS) patients treated with natalizumab, the low level of L-selectin-expressing CD4+ T cells has been associated with the risk of progressive multifocal leukoencephalopathy (PML). In this study, our aim was to correlate the levels of soluble L-selectin and the anti-JCV antibody index in the sera of RRMS patients treated with natalizumab.
METHODS: This study included 99 subjects, including 44 RRMS patients treated with natalizumab, 30 with interferon beta (IFN-β) and 25 healthy controls. The levels of soluble l-selectin (sL-selectin) in sera were measured by ELISA, and the anti-JC Virus (JCV) antibody index was determined by the second-generation ELISA (STRATIFY JCV(TM) DxSelect(TM)) assay.
RESULTS: A significant correlation was found between the levels of sL-selectin and anti-JCV antibody indices in sera in the natalizumab-treated patients (r=0.402; p=0.007; n=44), but not in those treated with IFN-β. This correlation became even stronger in JCV seropositive patients treated with natalizumab for longer than 18 months (r=0.529; p=0.043; n=15).
CONCLUSION: The results support the hypothesis of sL-selectin being connected to the anti-JCV antibody index values and possibly cellular L-selectin. Measurement of serum sL-selectin should be evaluated further as a potential biomarker for predicting the risk of developing PML

L-selectin, also known as CD62L, is a cell adhesion molecule found on lymphocytes and the preimplantation embryo. It belongs to the selectin family of proteins, which recognize sialylated carbohydrate groups. L-selectin acts as a “homing receptor” for lymphocytes to enter secondary lymphoid tissues via high endothelial venules. Ligands present on endothelial cells will bind to lymphocytes expressing L-selectin, slowing lymphocyte trafficking through the blood, and facilitating entry into a secondary lymphoid organ at that point. The receptor is commonly found on the cell surfaces of T cells. Naive T-lymphocytes, which have not yet encountered their specific antigen, need to enter secondary lymph nodes to encounter their antigen. Central memory T-lymphocytes, which have encountered antigen, express L-selectin to localize in secondary lymphoid organs. Here they reside ready to proliferate upon re-encountering antigen. Effector memory T-lymphocytes do not express L-selectin, as they circulate in the periphery and have immediate effector functions upon encountering antigen. This can be shed and in this study, soluble L-selectin correlates with anti-JC virus antibody, which is an indication of whether you have had contact with JC virus, but the r=0.4.

So if you had a certain level of L-selectin would it tell us what your anti-JC titre is. I would say the answer is certainly no, so the predictive value is low so I would guess not a good biomarker. 

This level of correlation in my mind has been the problem with many MRI outcomes, which have no clear pathological correlate. 
So on a population level it may get a hint of a relationship, but for any individual it has no real meaning full outcome unless they are on of the extremes.

So when you see a paper saying this aspect correlates with that aspect, look at the r value to see how much confidence you can put in the conclusions.

About the author



  • "look at the r value to see how much confidence you can put in the conclusions."

    I think this is poor advice. The correlation coefficient is pretty worthless as a summary statistic, a high r value is neither necessary nor sufficient for a strong relationship between the dependent and independent variables. There's an especially nice discussion of this in "Advanced Data Analysis from an Elementary Point of View", which is available for free on Cosma Shalizi's website.

    • I like to do the "smack you in the eye test" to see if the data really looks different, in many cases the test is failed.

      If so worthless why would people report it and you have lost the readership with dependent and independent variables.Few people understand statistics and statisticians can sit on both sides of the fence

    • Enjoy the Cosma website…but not the easiest of reads.

      The value of a correlation is if having one outcome that can tell you about another at the individual level. It is does not do that, it is academic

    • I think the "smack you in the eye test" is a great idea, much better than using the correlation coefficient. The problem with summary statistics like r is that they give a false sense of certainty and precision, almost never mean what people think they do, and require conditions on the data generating process that are almost never met. The human visual system is much better than these measures at identifying robust patterns in data.

    • Yes 95% of animal studies in the top flight journals use in my opinion the wrong stats, because the assumptions of the statistics do not hold. Maybe this is why we get so many false positives and I can prove this for a Nature paper.

      We have to ask what is biologically significant not statistically significant. This helps you to log and remember are you going to change practice. When you see poor correlations,which we see over and over again in clinical MS studies, will encourage you to change practice or should you be demanding this or that test the probable answer will be no otherwise you will believe that wearing high heels causes psychosis-see the cats post:-)

    • "I think this is poor advice.

      Really, you're happy with the deluge of papers with correlations that mean very little….Hope a raw nerve has not been touched:-).

      Maybe can have a meta analysis between r=0.3 life impacting studies verses or r=0.3 that was not repeated or had no real utility studies, compared to r=1 studies:-)

      Seriously if people think their are good educational sites that help you understand statistics send a link

    • I also prefer studies with tight fit to the data, so I'm not objecting to that. I'm objecting to the particular formal methods that are commonly used to measure tightness of fit, such as the correlation coefficient. There are many other formal methods that have been developed, but unfortunately it's not "one size fits all", you need to calibrate your use of these methods to the particular data set that you want to analyze.

      As far as studies with low correlation but high impact, what about GWAS? I think the MS GWAS are very interesting, even though the identified genes only account for a small portion of disease heritability. The story is similar for other diseases, e.g. autism and schizophrenia.

    • GWAS. Do you mean GWHAT, so far it has not found anything that has made an impact
      on how you deal with MS.

By MouseDoctor



Recent Posts

Recent Comments