Analyzing unstructured data – the next new trend in predictive analysis that can even predict the stock market?
A widespread view is that structured and semi-structured data constitute only about 10% of the data aggregated by companies. The remaining 90% of all data available is unstructured. Structured data breaks down well into fields and can be stored for example in a data warehouse. Unstructured data does not conform to the formal structure of data models, i.e. it has to be “structured” first, for example by transforming it into numerical inputs that, in turn, can be used in predictive models or traditional analytic algorithms. This requires classification according to predefined criteria before it can be integrated into an existing system.
Three categories of data: examples
- Unstructured data: documents, videos, images, emails, recorded call center conversations, or social media posts, full-text articles
- Structured data: usually numerical input such as number of website visits or click-through rates
- Semi-structured data: XML, web server logs, can also consist of data fields with unstructured data.
A good example to illustrate the difference between structured and unstructured data are emails: Whereas emails can be classified – and retrieved – by date, time, sender and recipient, the email’s body text is full of unstructured data that needs to be classified first. For example, it can be indexed by topic or intent. Another useful classification system is the five W’s: who, what, where, when, why things are happening as described. Obviously, these are all things the human brain processes as a matter of course while reading an email. But this can be very difficult for a machine.
Sifting through masses of unstructured data is costly, and many businesses shun these costs, thus missing out on big opportunities to glean information from the biggest of the big data sets. To meet this demand, the past few years have seen a rise of natural language processing algorithms that promise to help business to “understand” unstructured data. This is called text analysis, and it is considered the next big thing in big data analytics.
Common applications include voice of the customer (VoC) solutions or sentiment analysis on social media. Sentiment analysis offers to “structure” conversations or opinions on social media, so that they can be integrated into existing business intelligence systems for further analysis as structured data, thus allowing businesses to “listen” to these conversations and generate a real-time snapshot of how the public feels about any relevant topic. Therefore, sentiment analysis is also known as opinion mining, social listening or has even been called “emotional AI”.
Sentiment analysis is easy for short informal texts such as tweets (for a summary on how machines are interpreting meaning see our blogpost on natural language processing). However, analyzing sentiment on a bigger level, e.g. document level, would presuppose that the document is only about one subject and expresses only one unambiguous sentiment about it. Obviously, this is almost never the case for longer texts. The difficulty lies in where to break the document down. Even a paragraph can represent an opinion on different topics or express conflicting sentiments that can also differ in their intensity. In other words, long texts can be too ambivalent for a machine to analyze.
In finance, sentiment analysis offers the possibility to get real-time investor sentiment or even predict markets in ways that were not possible before. However, it is important to realize that the concept of “sentiment” is still rather basic and mostly represents a simple for or against decision. On top of that, a sentiment can be heavily biased, and it is questionable whether a machine can rule out faulty perception.
Nevertheless, a study from 2010 claimed Twitter “can predict the stock market”. In fact, Twitter is an excellent source to listen to what people are talking about. Firstly, Twitter verifies their identities as belonging to real people – among them many celebrities who are considered influencers and thus interesting to mine because of their multiplication effect. Secondly, the 140-character format is not too complex to analyze because the references to topics or subjects are rather unambiguous which makes sentiment rather easily determinable. For this reason, microblogging analysis has become a regular feature of many opinion-mining solutions. However, Twitter users only represent a small or “skewed” part of the general population, so its predictive power may be limited after all. In fact, microblogging as a source of data is considered only adequate for a very high-level analysis.
A further promising source of data is news text analysis. Although the results were not yet satisfactory, a recent study describes how sentiment analysis could be used to predict stock price changes: In order to create a predictive model, historical data of stock indices was merged with data extracted from news headlines of articles in the New York Times archive. Articles that contained information on companies were mined for sentiment and fed into a machine-learning model, a system that would allow a machine to discover patterns and interpret data on its own by feeding a lot of data into it. However, the question remains whether predictions and analysis based on past data is helpful for generating insights into the future. There seem to be more variables that need to be taken into consideration.
Advance analytics like machine learning methods are showing promising results in determining what textual data is about. At the same time, these unstructured data sets are becoming more complex such as full-text documents or longer articles, blogposts and other macro-content, videos, speech transcripts or images. It may need time before a machine is able to gather, interpret and convert unstructured data into intelligence that can be trusted.