Tuesday, April 09, 2013

Imbalanced predictions from imbalanced data set

This time, it is more like a question to you. Any thought and/or idea are more than welcome.

You might know about Machine Learning classifiers (including logistic regression).
It builds a prediction model using training data (most case, historical data), and using the model, it can predict the response of system, or the class.    

It is widely used in wide range of application, such as, image recognition, bioinformatics, marketing (to find the right customers, who will make a purchase), financial firms (to estimate the risk of default) and so on.

However, since the models are trained to maximize the overall accuracy, there will be a problem when we have highly imbalanced data, which is easy to find in reality.

What do we mean by 'imbalanced'?  In data set, most of responses are 0 and very small number of them are 1, in case of binary classification problem.

My current work is about 'descriptive' Accept/Decline(A/D) decision model.
What kind of Acceptance model?

I am working on a simulation model for evaluation of organ allocation policy, SAM (Simulated Allocation Method).

In SAM, there is a A/D model. Why?
Surprisingly more than 50% of times the first candidate (recipient/patient) says "No, I am not going to take that organ. I am going to wait for better one." The offer keeps going to the next and next candidates until one of candidates says 'Yes' or OPO(Organ Procurement  Organization) gives up.

So, it is important to know who would say 'Yes' and takes the organ in simulation.
That's why we are working on a descriptive model rather than a prescriptive model.

Imbalanced data :  Since the offer is going only up to the first Yes, the most of responses (Y/N or A/D) in the data set are 'No's.   Ex) N, N, N, N, N, Y (Stop)

Using this imbalanced data (we have a case of 19.5:1 ratio), any kind of prediction model would ignore the minority label (Y). Our predictor will be very accurate on predicting the majority but very poor at predicting the minority. And, even a dumb predictor that predicts everything as NO can hit a high (around 95 %) accuracy.

Of cause, there are several techniques to correct this imbalance and new measures instead of just % accuracy. Under sampling, over sampling, and different error cost can be used. What All of them do is emphasizing the minority. Ok, by doing so, we will get a prediction model which is somewhat good at predicting both the majority and minority.  You can find a lot of literatures dealing with these techniques.

The problem comes here. Since our model was built using the technique of emphasizing minority, our model has higher tendency of predicting minority label (Y) than it should.
Earlier, I said we have 19.5 No labels per one Yes label. but, by using these technique we are predicting about 3~4 times more Yes labels; the predicted label ratio becomes about 6:1.

This emphasized prediction model might be fine for some purposes, like marketing. When the firm send out their catalogs, they may use a prediction model to find customers who will buy the product thru catalog.  It doesn't matter the firm send out more catalogs than the real purchases.

But, some applications, it is important to predict the labels(responses) with the right proportion.
We are developing an evaluation method which is specified to our problem. But it will not work for general cases. (If you are interested in my current work, come to Chicago in June, @INFORM healthcare conference ^^)

And, I could not find a literature which tells me what to do with this model with emphasized tendency.

What can we do with imbalanced data when the distorted tendency is not good for the problem?

Thursday, April 04, 2013

Stochasticity, Randomness and Uncertainty

Operations Research area can be divided into two parts, deterministic and stochastic.
The line between these two areas is getting blur these days.
Stochastic programming is one of examples that is located in between.

Some people use stochastic, random and uncertain interchangeably.
People out of OR community may do so. And, they may not even know the word, 'stochastic'.
But, I saw even within our OR community, people make mistakes.

When we solve some optimization problems, we need to know the system and its' parameters, such as demand, lead time and so on in supply chain system, for example.
When we know the exact numbers or it is ok to set them as fixed numbers, we use the numbers as known parameters. And, it will be a deterministic problem.
However, when we don't know the exact figures, and when it will not be fine to set them as fixed numbers, what should we do?
Yes, probably, we need stochastic.

Then, a Question, here. Do we employ 'Stochastic', when we are lack of information, because we don't know what it will be?
No. Indeed, it is opposite.

Stochastic model requires much more information on the system than deterministic one does, assuming we are dealing with the exact same problem.      
We need to know the random factors' probability distributions. And, distribution cannot be expressed by couple of numbers. (Mean and standard deviation are not good enough unless we know the type of the distribution.)
When we don't know what the variable will be, yes, it is 'uncertain'.
When we have no idea at all on what it will be, it can NOT be 'stochastic'.
Uncertainty includes Stochasticity, obviously.

Of course, there is research areas within, we call, 'stochastic' OR, without knowing exact distributions, such as Beysian update and worse case scenario. But, still you need to know something about the uncertain variables.

I don't either have a clear distinction of 'random' from 'uncertain' and 'stochastic'
My feeling on 'random' is more close to stochastic than to uncertain (non stochastic).
Any idea on definition of 'randomness' and/or any thought welcomed.

P.S. Some people say stochastic problem is more difficult than deterministic one. It is true if the system and models are exactly same. But, that's why research problems on deterministic are far more difficult than the research problems in stochastic, if you look at the models itself, ignoring the variables are deterministic or stochastic.  

Tuesday, October 23, 2012

What I am, Post Doc.

I am a research associate, which is the next level position of research assistant.
In professor's rank, becoming associate from assistant means getting tenured.
Then, what is the difference between research associate and research assistant? PhD degree.
Additionally, the salary jumps up 2~4 times.
(Sorry, professors. You cannot make your salary doubled by the promotion to associate)

What am I doing as a research associate (aka Post Doc)? Almost the same as what I did as research assistant(aka graduate student), though a little difference on the attitude.

Why did I become a PostDoc?
Frankly speaking, it was because I couldn't get a job in academia.
(I haven't made an effort to get a job in industry. I am not sure how tough it would be)
I need some time to publish my papers.

I am hoping that after two years of PostDoc, I could be a strong candidate on academic job market.
It will be not only because my previous works will be published, but also because new research experience as a PostDoc will be a great value adding factor.

"Across all S&E fields and cohorts, 53%–56% of former postdocs said that their postdoc appointment enhanced their career opportunities to a "great extent"; an additional 33%–38% said that their postdoc appointment "somewhat" enhanced their career opportunities."  - Science & Engineering Indicator 2010

I will enhance my career opportunity by about 90% of chance. Good for me!

Now, how many doctorate recipients become a Post Doc?
I couldn't find the specific data for OR area. But I could get the following information.
As OR is a part of mathematics, engineering and Computer science simultaneously, we will focus on  Math., Eng. and CS.

Percentage of doctorate recipients : Post Doc


Click to enlarge the graph.


As you can see, the percentages are more than doubled in Math. Eng. and CS from 1982 to 2002. Now, a decade has passed from 2002. I guess it increased even more, thesedays.


Post Doc Salary

Table 3-23
Median salary of U.S. SEH doctorate holders in postdoc positions: 2008
Median salary ($)
Field of doctorateAcademic postdocsNonacademic postdocsNonpostdocs
All SEH42,00050,00075,000
Biological/agricultural/environmental life sciences41,00047,00065,000
Computer/information sciences46,000S90,000
Mathematical sciences52,000S71,000
Physical sciences43,00057,00075,000
Psychology42,00048,00060,000
Social sciences47,000S62,000
Engineering43,00057,00090,000
Health43,00063,00080,000
S = suppressed for reasons of confidentiality and/or reliability
SEH = science, engineering, and health
NOTE: Salaries are rounded to nearest $1,000.
SOURCE: National Science Foundation, National Center for Science and Engineering Statistics, Survey of Doctorate Recipients (2008), http://sestat.nsf.gov.
Science and Engineering Indicators 2012
http://www.nsf.gov/statistics/seind12/c3/tt03-23.htm

Compared to the salary of graduate assistant, it is more than doubled. However, still much less than non post doc's. As I remember, my salary is as much as the average salary of engineering undergraduate.
Ok for the single, but still not good for the married especially with kids.

More detailed information can be found the following link
http://www.nsf.gov/statistics/seind12/c3/c3s3.htm#s7

It seems that as the number of Post Doc increase, the expectation of publication increases, and the inflation of publication pushes more doctorate recipients into Post Doc. It looks like a vicious cycle forms, making job market tougher along with economic recession.

The academic job market seems getting better last year and this year, again.
I cross my fingers hoping it will be even better next year.

Saturday, October 20, 2012

Variance and Standard Deviation

Why do we need Standard deviation even though we already have Variance?
Anyone?

Average (Arithmetic mean), Variance, Standard Deviation are the three most basic statistics. 

I guess all of my readers are familiar enough with average, variance and standard deviation. This post is more about how to teach these to the students. 

This is the way I taught those concepts when I taught Math. for GMAT preparation. 

There are 10 test scores. And let's say the average is 50 and , the variance is 16. 

Q1: what is Standard Deviation (StDev)? 
Easy, Almost every student can answer this question. 

What if each test scores are doubled?
Average? Sure, still easy. It will be 100. 
Q2 : Variance ? (or StDev?) Emm....
I am not sure how many can answer this question right away.

A story of a principal who doesn't know anything about statistics
There is a school principal who wants his teachers to teach their students in a way that 1) the higher the overall score the better and that 2) the more similar scores to each other students the better. He wants the equality on the test score.       

There are two classes, A and B in his school. The principal asks the total sum of test scores in order to see the overall performance.
$\begin{eqnarray} \sum_i^{N} {x_i}^{} \end{eqnarray}$
Then, the teacher of class A argues that he has small number of student than class B, so the total sum is not fair. The principal agrees. So they decide to apply a ratio of number of students on the total sum so that the result can be the score per student.
$\frac{\sum_i^{N} x_i}{N}$
What is it? Average!

Now, the principal wants to know the equality of scores, so he asks teachers to subtract an average from each number and sum them up.
$\sum_i^{N} {(x_i-\mu)}^{}$
It seems a brilliant idea. if there are more scattered scores, the measure will become larger. Oops, he realized that more students will lead to higher number. So, he decides to divide the measure by the number of students
$\frac{\sum_i^{N} {(x_i-\mu)}^{}}{N} $
 Now, he is satisfied by his brilliant idea.
A problem comes when two teachers report their numbers. They report all zeros.
'O-ho, there will be positive and negative differences from the average and they all are cancelled out!'
'How can I avoid this ? Yes, let me apply square, so that all the differences turn to be positive'
He ask teachers to do so.
$\frac{\sum_i^{N} {(x_i-\mu)}^{2}}{N} $
This is what we call Variance.

The principal is happy with the measures, and seems no need to make another measure.
And, then the score system is changed. The full score is changed from 100 to 200. (It could be because they need to aggregate scores of several subjects or of multiple tests.) All the scores are doubled. Now, everything is doubled even the distance from the average. So the principal expects the variance will be doubled. However, it becomes 4 times. He doesn't like it, because whenever the full score changes their variance changes also but as squared. So, he applies square root on the variance so that whatever the score system changes the measure will change by the same scale.
$\sqrt{\frac{\sum_i^{N} {(x_i-\mu)}^{2}}{N}} $
This is the birth of Standard Deviation.

This story may not draw your interest at all. However, this story must work for the students quite well.

Now, the answer of Q2 above is as easy as the doubled average.
And what if all the scores are tripled? :)

Try this to your students.
They will be even able to memorize the whole formula of standard deviation right after the story telling. (It might be so easy to memorize for you, but not for them.)

I believe that the best way to teach is showing the flow of logic. 

Friday, October 19, 2012

Blogger Newbie

Yes, I am starting blogging.

I thought about it once a long time ago. But, I decided not to blog, because I thought I might not be able to maintain my blog ALIVE. (Secondly, I didn't want to reveal my broken English to the world.) 

Last Sunday at INFORMS, Paul Rubin came to me (no, I came to him) and told me 'Why don't you start blogging?' He said there are many bloggers who are faculty members and who are graduate students, but no post doc blogger in O.R. He wanted me to blog about how the post doc's life is.  Are people interested in Post Doc's life? Maybe, maybe not. It doesn't matter. At least I have something to blog. 

And the next day, Laura McLay mentioned that blogging is like gardening, and it can be a cactus garden, while she presented about blogging at social networking session. 

Yes, at least I can keep a cactus garden. I decided to start blogging, hoping I can upgrade my garden later. So, I reopen my blogger account that was opened a long time age. 

I am writing my first blog posting. 

Obviously, it will be about Operations Research mainly, but highly focused on stochastic part of it. 
And, it will be a little bit about Post Doc's life as Paul asked.  
Additionally, sometimes, it will be about Computer, Language or Software. 

Let's see how my cactus garden grows.