2014年9月30日星期二

Methods for sentiment classification

      Nowadays, e-shopping is no longer a strange concept to most people. Through the Internet, what we need to do is just to click the mouse to choose the commodity we want, saving our time and energy greatly. When we make our decisions, apart from our personal factors, other people’s comments may also influence our choice.
     Not only in e-shopping, but also in many other areas, the comments associated with the entity play a more and more important role in our daily life. Thus, making a statistics would be meaningful to divide these comments into three classes including positive, neutral and negative ones.
    There are many methods to realize this function, while one of them is dictionary-based. In this method, each word will be given a score depending on the distance from three polarities which is recorded in the dictionary set up in advance. Even though it maybe easy to deal with a single word, however also has its own limitations. One is that can not operate words not included in the dictionary, so great deal of time is required to build and update the dictionary.
    There is another method called supervised learning which can work more efficiently if we have some documents which we have known their sentiment class already. Using these documents, we can train a classifier to simplify our latter work. When a new word coming, it would get a score in the classifier and be dropped into corresponding class according to the thresholds. In this way, new words can be classified and the classifier will be updated at the same time. Meantime, the problem of this method is its low ability to resist error. When the meaning of a word fuzzy, it is much easier for classifier to give it a score near to the threshold which would probably lead to a wrong classification. This would introduce error to the system and the error accumulation would happen during updating, making classifier work worse and worse as time goes by.
figure1 classify a new word
figure2 error happens
figure3 error accumulation

    Therefore,we should choose different methods due to corresponding situation to control the workload and efficiency.   

2014年9月29日星期一

Some thoughts about Natural Language Processing

     With the widely use of electronic products in our daily life, how to improve the quality of communication between people and electronic products becomes more and more important. Under this situation, the concept of Natural Language Processing comes into people's sight. There are two parts of NLP. One is Natural Language Generation System, aiming to convert computer data into natural language. The other one is natural language understanding system, which can transform natural language to a easier mode for computer to understand.
     Even though when we communicate with the computer in limited words, NLP performs well. However when this system is put into the environment with more uncertainty and ambiguity, what we get disappoints us. The reasons leading to the decline mainly include the difficulty to define the boundary between words, vocabulary polysemy, syntax blur and non-standard input. Here is an example. When we say "hehe", it maybe means a politely refusal, or it can also show a disdain to what we heard. In addition, the implied mean behind the sentences also brings some confusedness. If I say "would you please bring me the salt?", what I really mean is hoping you to bring the salt to me, not a simply "Yes".
     Depending on the problems existing in NLP nowadays, we need to do some targeted improvements. Firstly, taking more processing on real text instead of traditional analysis based on grammar. Secondly, updating the glossary in time. Finally, focus on both shallow and deep layer of understanding when analyzing. If we can find a suitable method to implement such measures, NLP would probably perform much better in near future.