2014年11月10日星期一

Maybe you need some advice

     Tomorrow would be a great day for majority fans of online shopping for the coming activity of global free shipping on Nov.11th, so-called "Double-eleven". I totally believe many people have awaited this day for quite a long time. One of my roommates is exactly in such a group. He has kept surfing the Internet several hours a day to search items and see the comments since one week ago. He even doesn’t know what he wants to buy but still searches and searches. Actually, what really matters to him is just the discount and the bonus of free shipping, if he doesn't buy anything, the feeling of losing something will make him upset. "Perhaps it will be useful someday.", "Well, what is 'it' you'd like to mean?", "I'm right search for it."
     Nowadays, my roommate's behavior doesn't seem too strange in our daily life. When we want to make a decision, there are so many information we can refer to which can make us confused so easily leading to a harder decision. The culprit responding to this is the overload of information, which is one of the negative impacts of too rich information in this age. It becomes harder and harder for us to react as sensitive as information dissemination as well as many useless information associated with what we really want also reduce our efficiency. 
figure 1
     Because of this situation, when we make decisions, we'd like to ask people something in common with us for advice. Their advice depending on their own experience helps us from making a bad decision. 
figure 2
We can also ask someone else for a hand, which is a computational system called "recommender system". It works to analyze your behavior in the past and give you suggestion this time. There are three most popular method to design recommender system, that are collaborative filtering, content-based filtering and hybrid recommender system. Collaborative filtering methods are based on collecting and analyzing a large amount of information on users' behaviors, activities or preferences and predicting what users will like based on their similarity to other users. A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an "understanding" of the item itself. Another common approach when designing recommender systems is content-based filtering. Content-based filtering methods are based on a description of the item and a profile of the user's preference. In a content-based recommender system, keywords are used to describe the items; beside, a user profile is built to indicate the type of item this user likes. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended. Recent research has demonstrated that a hybrid approach, combining collaborative filtering and content-based filtering could be more effective in some cases. Hybrid approaches can be implemented in several ways: by making content-based and collaborative-based predictions separately and then combining them; by adding content-based capabilities to a collaborative-based approach (and vice versa); or by unifying the approaches into one model. Several studies empirically compare the performance of the hybrid with the pure collaborative and content-based methods and demonstrate that the hybrid methods can provide more accurate recommendations than pure approaches. With the help of this smart friend, our life  can return to its easy state as it used to be.
figure 3

     Perhaps my roommate finally feels boring to this endless searching, so he decides to ask for some advice to his smart friend, just standing before him. "What would prefer 'it' is?", "As a photographer, a fit light stand undoubtedly will make you more professional". "Maybe you are right. I'll looking for one.". Great, I'm finally no need to lift the spotlight for him any more.

2014年10月16日星期四

social network analysis

     Social network analysis is a kind of research methods studying the relationship between a group of actors. A group of actors can be people, communities, groups, organizations, countries, etc.. The phenomenon and data reflecting from their relationship model is the focus of the network analysis. From the perspective of social network, the interaction among people expressed in social environment can be viewed as a model or rules based on the relationship, and the regular pattern based on such relationship reflects the social structure. The quantitative analysis about this structure is the starting point of social network analysis.
     In social network analysis, there are some important concepts, which include degree, closeness and centrality. With the help of these concepts, we can get more useful information through the social network.
     Degree is used to describe the number of edges connected to the vertexes in graphs. The degree of an actor can show the number of ties he has in a social network and the number of people he is connected to. Depending on this data, we   might conclude the importance of this actor among other people.
    Closeness is another important concept referring to the extent to which an actor is close to all other people in the social network. There maybe more than one path between two vertexes and we only consider the shortest one. By adding all of the shortest distance from one vertex to all other ones and inversing the answer, we get the closeness of this vertex.
    Centrality is the measure of relative importance of an actor in a social network. It can be measured by many methods. One of them called betweenness centrality is defined by the number of shortest paths in the network that pass through one vertex. Betweenness centrality can reflect the importance of an actor in a social network. If an actor with high betweenness centrality is removed, many shortest paths will be altered at the same time.

    There is a proverb which is “Will matter is not what you know, but who you know.” briefly and aptly explaining what social network analysis actually means, I think.

2014年9月30日星期二

Methods for sentiment classification

      Nowadays, e-shopping is no longer a strange concept to most people. Through the Internet, what we need to do is just to click the mouse to choose the commodity we want, saving our time and energy greatly. When we make our decisions, apart from our personal factors, other people’s comments may also influence our choice.
     Not only in e-shopping, but also in many other areas, the comments associated with the entity play a more and more important role in our daily life. Thus, making a statistics would be meaningful to divide these comments into three classes including positive, neutral and negative ones.
    There are many methods to realize this function, while one of them is dictionary-based. In this method, each word will be given a score depending on the distance from three polarities which is recorded in the dictionary set up in advance. Even though it maybe easy to deal with a single word, however also has its own limitations. One is that can not operate words not included in the dictionary, so great deal of time is required to build and update the dictionary.
    There is another method called supervised learning which can work more efficiently if we have some documents which we have known their sentiment class already. Using these documents, we can train a classifier to simplify our latter work. When a new word coming, it would get a score in the classifier and be dropped into corresponding class according to the thresholds. In this way, new words can be classified and the classifier will be updated at the same time. Meantime, the problem of this method is its low ability to resist error. When the meaning of a word fuzzy, it is much easier for classifier to give it a score near to the threshold which would probably lead to a wrong classification. This would introduce error to the system and the error accumulation would happen during updating, making classifier work worse and worse as time goes by.
figure1 classify a new word
figure2 error happens
figure3 error accumulation

    Therefore,we should choose different methods due to corresponding situation to control the workload and efficiency.   

2014年9月29日星期一

Some thoughts about Natural Language Processing

     With the widely use of electronic products in our daily life, how to improve the quality of communication between people and electronic products becomes more and more important. Under this situation, the concept of Natural Language Processing comes into people's sight. There are two parts of NLP. One is Natural Language Generation System, aiming to convert computer data into natural language. The other one is natural language understanding system, which can transform natural language to a easier mode for computer to understand.
     Even though when we communicate with the computer in limited words, NLP performs well. However when this system is put into the environment with more uncertainty and ambiguity, what we get disappoints us. The reasons leading to the decline mainly include the difficulty to define the boundary between words, vocabulary polysemy, syntax blur and non-standard input. Here is an example. When we say "hehe", it maybe means a politely refusal, or it can also show a disdain to what we heard. In addition, the implied mean behind the sentences also brings some confusedness. If I say "would you please bring me the salt?", what I really mean is hoping you to bring the salt to me, not a simply "Yes".
     Depending on the problems existing in NLP nowadays, we need to do some targeted improvements. Firstly, taking more processing on real text instead of traditional analysis based on grammar. Secondly, updating the glossary in time. Finally, focus on both shallow and deep layer of understanding when analyzing. If we can find a suitable method to implement such measures, NLP would probably perform much better in near future.