In Andrew Gelman style, I am going to start turning some emails into blog posts. Mainly the motivation here is when there is material that I believe will be useful to a broader audience than just the recipient of the email. For those who are considering emailing me, I will ask you whether I can post to the blog or not (unless you are my mother, in which case I won't ask). The additional benefit is that if others have words of wisdom they wish to add, they can do so here (at least for a limited time). Without further ado, from Eric:
Data: I'm trying to work out how to analyze data where some variables are measured daily and others are a final result at the end of a growing season.  When you are doing the epidemiology studies - it seems the cases are counted up based on time, and then interventions are specific events.  Are these specific events incorporated into the models themselves (becoming an independent variable?) or just shown on the charts to visualize the effect? How often do you run into over fitting issues?  That is one of our concerns with our data set is that we have so many variables. Also the data set doesn't seem to neatly fit into any of the normal tools (statistics or data mining) that are out there.  But probably that is just because I'm not up to speed on what the scientific/statistical world normally does.  We have at least one dependent variable (yield) which is measured as a continuous variable, and we have dozens of independent variables, some of which are continuous, some are binary, and some are labels with many options (soil type, last year's crop type etc).  From what I've read so far, some techniques want all binary/label type independent variables, and other techniques use only continuous independent variables.  Do I need to 'artificially' break a continuous variable into a binary label (i.e. instead of using rainfall amounts, bin certain time periods (days/months/growing season ?) as low rain fall and high rain fall? Or are there some methods out there I should be reading up on? When I read your research interest "I build models to represent the fluctuations of protein levels within cells. These models are non-linear, multivariate processes that include both intrinsic and extrinsic noise as well as measurement error." I substituted in my application and thought maybe the same tools should be used.  (We have a multivariate situation, and things like the meteorological data could have a lot of measurement error.   But now I'm wondering if there is a big distinction for the term 'model'.  Does your 'model' include the connotation that it is a continuous system, that can (possibly) reach a steady state, but not a final destination?  So in our situation where we will reach the end of a growing system, does that call for entirely different methods from what you are doing?  Are Markov chains a subset of regression analysis, or an entirely different thing?
My response: For me a model is a statistical model which describes a generative process for how the data actually observed arose. This is in contrast to most mathematical models which could never have generated the data yet may describe the data very well. It is unclear to me what computer scientists think a model is. In machine learning, unsupervised and supervised learning try to partition the data into groups. AFAIK, this is done typically without a statistical model. In statistics, we do the same thing (but we call it clustering and classification) and it is always based on a statistical model. The model could be continuous or discrete, it could have a steady-state or not, etc. Markov chains are stochastic (random) processes. They are chains because the observations from the process have an order. They are Markov because the evolution only depends on the most recent observation, not the whole history of observations. Markov chains can be used as models for data, although my main use for Markov chains is in estimating parameters through a technique called Markov chain Monte Carlo. Regression analysis attempts to identify the relationship between one dependent variable (outcome) and several independent variables (covariates or explanatory variables). If you say regression or linear regression, the outcome is typically thought of as being continuous while Poisson regression would be for count data and logistic regression would be for binary data. There are many other possibilities for continuous, count, and binary regressions, but these are the most common. Regression analysis does provide a model for data if you want it to. The term multivariate specifically refers to the outcome and whether it is a scalar or not. The latter being multivariate data. This has no relevance to how many covariates you have which may be numerous. The situation you described where you have one continuous outcome (yield) and many possible covariates that may be continuous, binary, or label (nominal or factor), the standard approach would use linear regression. Regression will accommodate all of these covariates with ease, i.e. no artificial breaking of continuous covariates. But the `linear' in linear regression means that you are making a linear assumption about the relationship between yield and each covariate. If this assumption is poor, it might be advantageous to break the continuous covariate into a partition preferably with 3 or more categories. Also, you are allowed to take a function of a covariate, e.g. logarithm which would assume that there is a linear relationship between outcome and the logarithm of that covariate. In addition to the covariates you have and functions of those covariates, you can have additional covariates called interactions which is simply the product of the covariates. Given all these covariate possibilities, overfitting is certainly an issue. If you have n data points and p covariates (where p is the sum of all covariates, functions of covariates, and interactions), then if n>>p (n is much larger than p) you may not need to worry about overfitting. If n and p are of similar order, then overfitting will be a series concern. If n<p, you have the so called large p, small n problem, which there are also methods for. There are many ways of dealing with overfitting. The simplest is to use a model selection critierion such as AIC or BIC which weigh the benefit of adding additional parameters versus model complexity. (With a model that is complex enough we can fit the data perfectly, but the model will be of limited use.) The difference between AIC and BIC is how much they penalize complexity with BIC penalizing more and therefore preferring simpler models. Combine this with an automatic stepwise selection procedure to get some sense for what covariates are important. There are many other statistical methods that deal with non-linearities, interactions, and overfitting, e.g. multivariate adaptive regression splines (MARS) and Bayesian adaptive regression trees (BART). I'm not familiar enough with these (or other techniques) to know whether they require continuous covariates, although I can't imagine that they would. For your particular data analysis, the art will be in creating important covariates. For predicting yield, I would suspect the relevant covariates would be things like fertilizer, water, insecticide, etc. Perhaps the amount of water(ing) on the field is measured daily. Rather than taking these daily measurements as the covariates, you should feel free to create your own. For example, I can imagine that average water(ing) and standard deviation of water(ing) would be two relevant covariates. The former gives a sense for the total amount of water the field had while the second provides information about how that water was distributed, e.g. a little bit everyday or a lot every 2 weeks. Farmers may be able to give you more insight into the (perceived) importance of water at different times during the growing season which would suggest alternative/additional covariates.

blog comments powered by Disqus

Published

02 November 2011

Tags