Exploration vs exploitation problem

Started by
9 comments, last by alvaro 5 years, 4 months ago

I have the following question and wanted to know if my answer is more or less correct (makes sense):

 

Quote

Suppose a player can choose between five actions in all states of a game. And assume that the player has executed each action a different number of times in state 27, noting how valuable each action is in terms of the utility of the states reached after each action. Explain how the player should choose which action to execute next time s/he reaches state 27. Demonstrate your understanding of the exploration versus exploitation dilemma in your answer [2 marks]

My answer:

Quote

Upon reaching state 27, the player will already know the utility values of each action thus there won't be any need to explore any other actions. Therefore, the player can choose the action (exploitation) with the utility value that will return the highest reward.

Does this answer make sense? If not, what needs to be added or changed?

Advertisement

Is this homework? Job interview?

Hello to all my stalkers.

Its a question from a test paper where I can't verify whether my answers are correct.

10 hours ago, azhar_r said:

Does this answer make sense? If not, what needs to be added or changed?

Your answer seems logical at first glance, but also almost too straight forward to be worth 2 points. I don't know the material that you're covering at all (like zero), but I don't get a sense of a clear understanding of the exploitation vs. exploration dilemma that was mandated in the question.

Assuming that the question has been worded carefully and that state 27 is an arbitrary thing... I'd say that the player, upon returning to state 27, may or may not have taken the same actions as before. Thus, if the value of their current situation is better than before, they may be inclined to explore, as they are already doing well enough, and there might be better ways to increase value. If the current situation is worse than the previous time, they would most likely take an exploitative action to try and catch up to their previous value, as exploring gives unknown results.

I tend to over analyze, but to understand the "dilemma", your answer would have to give credence to either choice... otherwise, there is no dilemma. Your answer doesn't address any dilemma whatsoever, thus you must describe a situation to that favors exploration as well; 1 point for exploitation and 1 point for exploration.

My biggest pet peeve with tests is when a teacher glosses over the influence certain words can have over the direction of a question. Every word in a question must have purpose and many teachers fail their own tests with the quality of the writing in their questions.

I hope I'm not doing your homework for you, azhar_r, but I found this question too interesting to resist. ?

 

Thank you for your answer. Since it was already stated in the question that all five actions have been carried out before and the utility values are known, I thought the player would only exploit a particular action with the highest utility value.

But I understand your point about coming back to state 27 via other actions with "better" utility values which would then affect the actions being taken from state 27.

Thank you for your response. And no, this is directly from a test which unfortunately I don't have the memo to cross-check my answers.

No, I don't think you understand the issue at all. The result of the action is somewhat random, and the utility is assigned not to the actions, but to the individual outcomes. In other words, every time you take a particular action some utility is observed, but this is only a sample from a random variable, whose distribution is not known.

You want to pick actions with high expected utility (exploitation), but you only have a noisy estimate of this expected utility, so over time you want to try every action enough times that you discover which action that is (exploration).

Further reading here: https://en.wikipedia.org/wiki/Multi-armed_bandit

 

8 minutes ago, azhar_r said:

Since it was already stated in the question that all five actions have been carried out before and the utility values are known, I thought the player would only exploit a particular action with the highest utility value.

Yeah, obviously I don't know your course material, but I kind of read the question as a player can choose one of five things to do in each state and assumed that the 27th action would naturally be state 27. However, one part of the question bothered me a lot...

"And assume that the player has executed each action a different number of times in state 27,"

...and this wording sounds as if the player can do an infinite number of actions per state, which confuses me a bit because then a state is almost meaningless. And then right after, in the same exact sentence...

"noting how valuable each action is in terms of the utility of the states reached after each action."

...which arguably contradicts what was said prior in the sentence as each single action leads to a supposed new state. I absolutely detest poorly worded questions, especially by academic minds. They should not only know better, but do better. This question is a bit confusing when under scrutiny.

 

12 minutes ago, alvaro said:

Yeah, I agree with you, alvaro, but I think this question was intended to be more simplistic to achieve an undoubtedly concise, correct answer... or how would you have written the answer to the question for 2 marks?

I think the question is perfectly clear, although I don't know the context in which it is being posed. The OP was asked to demonstrate his understanding of the exploration versus exploitation dilemma, and he demonstrated that he only understands the exploitation side.

 

8 minutes ago, alvaro said:

I think the question is perfectly clear, although I don't know the context in which it is being posed. The OP was asked to demonstrate his understanding of the exploration versus exploitation dilemma, and he demonstrated that he only understands the exploitation side.

I don't mean to sound argumentative. I just like how you explained exploration as an unknown in the pursuit of discovering possible exploits (it makes sense) and was curious how your answer might differ from mine. I don't think there is any context to the question though. Just face value.

Also, the fact that you feel there might be additional context required kind of supports my position that the question is not worded very well. Curious, what part of the question do you feel needs context?

I mean that I don't know where this question was found. See Lactose's question. I don't know what a "test paper" is. I don't usually encounter questions worth "2 marks" (whatever that is) in the wild.

 

This topic is closed to new replies.

Advertisement