If the data is a little bit, then it must be great if the data is huge, right? This is like saying that if the breeze in a hot summer makes you feel cool, then you will be ecstatic for a burst of cool wind. The following is the translation:
Perhaps a better analogy to big data is that it is like a spirited champion horse race: with proper training and talented jockeys, a good breed of horse race can create a track record - but without training and riders, this powerful animal can't even enter the starting gate.
To ensure that your organization's big data plan is on track, you need to eliminate the following 10 common misconceptions.
- Big data is "a lot of data"
From its core, big data describes how structured or unstructured data can be combined with social media analysis, Internet of Things data and other external sources to tell a "bigger story". This story may be a macro description of an organization's operation, or a general view that cannot be captured with traditional analysis methods. From the perspective of intelligence collection, the size of the data involved is negligible.
- Big data must be very clean
In the world of business analysis, there is no such thing as "too fast". On the contrary, in the IT world, there is no such thing as "garbage, gold". How clean is your data? One way is to run your analytic application, which can identify weaknesses in the dataset. Once these weaknesses are resolved, run the analysis again to highlight the "cleaned" areas.
- All human analysts will be replaced by machine algorithms
The recommendations of data scientists are not always implemented by front-line business managers. Arijit Sengupta, an industry executive, pointed out in an article by TechRepublic that these proposals are often more difficult to implement than scientific projects. However, over reliance on machine learning algorithms is also challenging. Sengupta said that machine algorithms tell you what to do, but they don't explain why you do it. This makes it difficult to combine data analysis with the rest of the company's strategic planning.
Prediction algorithms range from relatively simple linear algorithms to more complex tree based algorithms, which are extremely complex neural networks.
- Data lake is necessary
According to Jim Adler, a data scientist at Toyota Research Institute, there is no huge repository. Some IT managers imagine to use it to store large amounts of structured and unstructured data. Enterprise organizations do not store all data in a shared pool indiscriminately. Adler said that the data was "carefully planned" and stored in an independent departmental database, encouraging "dedicated expertise". This is the only way to achieve the transparency and accountability required for compliance and other governance requirements.
- Algorithms are foolproof prophets
Not long ago, the Google Flu Trend project was widely hyped, claiming that it could predict the location of the flu epidemic more quickly and accurately than the US Centers for Disease Control and other health information services. As Michele Nijhuis of The New Yorker wrote in his article on June 3, 2017, people believe that the search for flu related words will accurately predict the area where the epidemic is about to break out. In fact, simply plotting the local temperature is a more accurate prediction method.
Google's flu prediction algorithm has fallen into a common big data trap - it has generated meaningless correlations, such as linking high school basketball games with flu outbreaks, because both occur in winter. When data mining runs on a group of massive data, it is more likely to find the relationship between information with statistical significance rather than practical significance. One example is to link Maine's divorce rate with the per capita consumption of margarine in the United States: although it is not of any practical significance, there is a "statistically significant" relationship between the two figures.
- You cannot run big data applications on a virtualized infrastructure
About 10 years ago, when "big data" * * * appeared in front of people, it was a synonym for Apache hadoop. As VMware's Justin Murray wrote in his article on May 12, 2017, the term big data now includes a series of technologies, from NoSQL (MongoDB, Apache Cassandra) to Apache Spark.
Previously, critics questioned Hadoop's performance on virtual machines, but Murray pointed out that Hadoop's performance on virtual machines is comparable to physical machines, and it can more effectively use cluster resources. Murray also blasted the misconception that the basic features of virtual machines require storage area networks (SANs). In fact, vendors often recommend direct attached storage, which provides better performance and lower costs.
- Machine learning is synonymous with artificial intelligence
The gap between an algorithm that identifies patterns in a large amount of data and a method that can draw logical conclusions from data patterns is more like a gap. Vineet Jain of ITProPortal wrote in an article on May 26, 2017 that machine learning uses statistical interpretation to generate prediction models. This is the technology behind the algorithm. It can predict what a person might buy based on their past purchase records, or predict their favorite music based on their listening history.
Although these algorithms are very smart, they are far from the goal of artificial intelligence, that is, to copy human decision-making process. Statistical prediction lacks human reasoning, judgment and imagination. In this sense, machine learning may be considered as the necessary precursor of true AI. Even the most complex AI system so far, such as IBM Watson, cannot provide insight into big data provided by human data scientists.
- Most big data projects have achieved at least half of their goals
IT managers know that no data analysis project is 100% successful. When these projects involve big data, the success rate will plummet, as shown in the recent survey results of NewVantage Partners. In the past five years, 95% of business leaders said that their companies had participated in a big data project, but only 48.4% of the projects had achieved "measurable results".
The big data implementation survey of NewVantage Partners shows that less than half of big data projects have achieved their goals, and "cultural" change is the most difficult to achieve. Source: Data Informed.
In fact, according to Gartner's research results released in October 2016, big data projects rarely cross the experimental stage. Gartner's survey found that only 15% of big data implementations were deployed to production, which was comparable to the 14% success rate reported in last year's survey.
- The growth of big data will reduce the demand for data engineers
If the goal of your company's big data plan is to minimize the demand for data scientists, you may get unpleasant surprises. According to the 2017 Robert Half Technical Salary Guide, the average annual salary of data engineers jumped to between $130000 and $196000, while the average salary of data scientists is currently between $116000 and $163000, while the average salary of business intelligence analysts is currently between $118000 and $138750.
- Employees and front-line managers will embrace big data with open arms
The survey by NewVantage Partners found that 85.5% of companies are committed to creating a "data-driven culture". However, the overall success rate of the new data plan is only 37.1%. The three most common obstacles mentioned by these companies are lack of organizational consistency (42.6%), lack of adoption and understanding by middle managers (41%), and business resistance or lack of understanding (41%).
The future may be big data, but getting the benefits of this technology requires a lot of old-fashioned hard work aimed at diverse human nature.