The first aim of this paper is to define which Big Data business model (in science-based activity) will be able to provide IT services to biotechnology and life sciences companies, as well as research laboratories. The second aim of the paper is to define a methodology to market a still widely unknown service for these companies and laboratories.
Since 2011 Big Data was identified as an emerging market considering the availability of huge amount of commercial and marketing data. Life sciences are known also to generate a deluge of data as an untapped source of information. The main approach was to identify what is specific for Life Sciences and what managers in Life Sciences companies and laboratories may expect from a Big Data activity-based company. Life Sciences are used to deal with large amount of data, most of them in well known structured formats. The question was to know what additional actionable information could be provided by Big Data technologies and analysis. We tried also to evaluate the needs and expectations of biotechnology and life sciences companies and research laboratories regarding data search and analysis using a survey on-line addressing life sciences companies and laboratories contacts. Most of responders require anonymised and secure data analysis and expect actionable information to launch new biotech product or to confirm a strategy.
Big data is a concept making the buzz since 2011 although its origin remains uncertain. Diebold (2012) originated the term in a lunch discussion at Silicon Graphics Inc. (SGI) in the mid-1990s. John Mashey, the renowned Chief computer scientist of SGI was supposed to be the first to spread the term of Big Data during a conference in 1998. However, IBM can be partly attributed the present hype around Big Data by popularizing the concept and investing in this new analytics market.
Intuitively, size is the first characteristic that can define Big Data. However, new features emerged recently to define more precisely the concept. Laney (2001) proposed a three dimensions common framework of Big Data: Volume, Variety and Velocity known as the Three V’s. Gartner, Inc. in its IT Glossary and TechAmerica Foundation uses similar definitions (Gandomi and Haider, 2015).
- Volume refers to the scale of data depending on the industry. Big Data sizes are reported in terabytes (1012) (TB), petabytes (1015) (PB) and even zettabytes (1021) (ZB). IDC projects that the digital universe will reach 40 zettabytes by 2040 (EMC study, 2014). For example, Facebook processes more than 500 TB daily, among them 300 millions photos uploaded and 2,7 billion “likes” (Jay Parikh, VP Facebook infrastructure engineering, 2012).
- Variety refers to the structural heterogeneity of the data. Structured data are typically tabular data found in spreadsheets and relational databases. Unstructured data are text, images, audios and videos. Semi-structured data can be illustrated by XML (eXtensible Markup Language) documents found to exchange documents on the web, in publishing industry and even in Microsoft Word documents.
- Velocity refers to the rate at which data is generated and the speed at which it should be analyzed and acted upon. For example, Wal-Mart processes more than one million transactions per hour (Cukier, 2010) and 70,000 queries are executed and 105 TB data scanned via Hive per day on Facebook. Moreover, digital devices such as smartphones and sensors are generating high-frequency data (geo-localization, demographics, buying patterns, physiological data).
Main IT companies have proposed three more dimensions:
- Veracity (the fourth V pushed by IBM) refers to the unreliability inherent to some source of data. For instance, customer opinion or feelings in social media are by nature uncertain.
- Variability and complexity dimensions were suggested by SAS Inc. and refer to the variation in the data flow rates. Periodic peaks and troughs due to server access or the number of requests to a source can alter Big Data velocity. The huge number of sources for Big Data generates the complexity of Big Data that needs to be connected, matched, cleaned and transformed.
- Value is a dimension presented by Oracle to distinguish low value data from high value data (i.e. analyzed data). In a Datameer Inc. White paper (2013), Groschupf et al. reported that the main objectives for companies to implement Big Data are: increase revenue, decrease cost and increase productivity, which is consistent with the fact to create value out of Big Data. The main issue of Big Data compare to classical models of reporting and monitoring is the predictive added value of Big Data (Brasseur, 2013).
Big Data in itself is worthless and requires data analysis to retrieve or acquire intelligence from the data and help in decision-making. Big Data processes or pipeline of extracting insights can be divided in two sub-processes: data management and data analytics (Gandomi and Haider, 2015). This process can be defined as a Business Intelligence system (BI) according to Wang and Liu (2009). A Business Intelligence system should have the basic features as followed (Marín-Ortega et al, 2014):
- Data Management: including data extraction, data cleaning, data integration, efficient storage and maintenance of large amounts of data
- Data Analysis: including information queries, report generation, and data representation functions
- Knowledge discovery: extracting useful information (knowledge or insights) from rapidly growing volumes of digital data in databases.
Most of the authors use to represent the Big Data analysis pipeline from a computer-driven process perspective. We suggest highlighting the role of data visualization as a result of data management and as a tool for data analysis (Fig 1.). Data visualization objective is to present information clearly and efficiently to viewers using developed graphics. Data visualization is a recent field presented as one of the steps of data science (see below) (Friedman, 2008) and a tool to communicate information from complex data sets. Fernanda B. Viégas, a Brazilian and MIT Media Lab -originated computer scientist founded with Martin Wattenberg, Google “Big Picture”, a data visualization research group. They suggested that an “ideal visualization should not only communicate clearly, but stimulate viewer engagement and attention” (Viégas and Wattenberg, 2011).
The role of human collaboration in refining the Big Data processes is also essential because each step requires a “human” decision (Jagadish et al, 2014). One challenge of the Big Data is to structure data that can be reusable. A human interference is needed to structure the data. Data analytics techniques support the processes handle by Data Scientists (Table 2.)
3.0 And Big Data in science?
There might be some confusion between the terms Big Data Science, Big data in Science, specifically in Biology or in Life Sciences. Big Data starts with data characteristics (up) whereas Big Data Science starts with data use (down) (Jagadish, 2015). The National Consortium for Data Science (NCDS), an industry and academic partnership, Chapel Hill in 2013 has defined data science as “the systematic study of digital data using scientific techniques of observation, theory and development, systematic analysis, hypothesis testing and rigorous validation”. Data Scientists are people specialized data analytics. For Big Data in Biology or Health, Data Scientists are bioinformatics scientists.
Big Data in Biology or in Life Sciences or in Health refers to Big Data computing in a specific field or industry so called Life Sciences (if we consider health as a Life science). In the past, Biologist used the term of bioinformatics to describe the methods and techniques to manage the data generated by research in Biology and medicine. The bioinformatics term although mainly dedicated to genomic data has fallen into disuse since the concept of Big Data emerges as a buzzword in 2011 (Fig 2.). This is the reason why we propose to use in this paper the term of Big Data in Biology or in Life Sciences to discuss specific applications and use of Big Data in this field.
The data generated by the Life Sciences are different from the data available in other fields or industries. Most of the biological data are generated by academics under the English language neologism of omics, which refers to a field in biology ending in –omics, such as genomics, proteomics, or metabolomics. An overview of biological data produced and availability is presented in Table 2. Data formats in Life Sciences (i.e. PDB format for protein, DICOM for medical imaging) are different and require specific software tools to manage them (i.e. FAST or BLAST software package to compare nucleic acids or protein sequences). Big Data management in Life Sciences will be able to combine classical data management (Table 1) and data analysis listed in Table 2 with specific data format management. For example, biology scientists use link prediction techniques to reveal links or associations in biological networks (i.e. cellular signal transduction pathways) in order to reduce cost of additional expensive experiments (Navlakha et al, 2012).
Interestingly, a search on Google Trends showed the terms “Big Data in Life Sciences” and “Big Data in Biology” generate insufficient data volume to display a result. When typing on Google Search “Big Data in Life Sciences” that most of the links listed are related to Big Data in Health. IBM, Big Data pioneer, for instance opened a very well documented and smart Big Data & Analytics Hub. The reason is that the Big Data revolution was expected to accelerate value and innovation and therefore to reduce cost (Groves et al, 2013). This McKinsey&Company whitepaper tried to demonstrate that US Healthcare costs could be reduced by $300 billion to $450 billion.
4.0 Specific issues of Big Data
4.1 Data governance
In a survey, PwC, a consultancy company, showed that 62% of Life Sciences executives changed the way they approach big decision making as a result of big data or analytics (2014). According to PwC again, 81% of the companies did not define any strategic data governance, and 54% of the respondents think that the top management are not concerned with the quality of data.
In Life Sciences, so far, most of the data were structured and available in dedicated databanks such as GenBank (genomics) or ExPaSy (proteomics) where data are recorded in specific and operable formats. Nevertheless, still a large amount of data is widely unstructured or deindexed and then unavailable for direct operation by users. In many countries, there are initiatives to index the Life Sciences data and think about the life cycle of data (i.e. RepliBio, a collaborative Brittany project framed by Biogenouest platforms).
4.2 Personal data protection
The main interest of Big Data is to collect any kind of data, most of them unstructured from the deeper and hidden Internet. Among these collected data, there is a lot a personal data (age, sex, hobbies, jobs,… ), which the big Internet players try to capture, use (marketing) and monetize (i.e. Google, Facebook). Confidentiality policies and General conditions of use proposed by these major players are published and accepted by a large majority of social networks users, turning a blind eye to the fact their personal data can be used, or sold to any interested operator. In Europe, users have to be informed about the potential use of their data, and the “right to be forgotten” on the Internet was finally reluctantly accepted by Google Inc. in 2014 under the French legal system (Court’s order, Dec. 19th, 2014) and European Court of Justice pressure (Judgment of May 13th, 2014). In 2012, the European Commission committed in a major reform of the EU legal framework on the protection of personal data. In the US, there is no single, comprehensive federal law regulating the collection and use of personal data.
Personal health data are a major concern. Medical and patient information are protected in Europe even these are useful data to understand disease and physiological processes. For instance, in Framework Program projects financed by the European Commission (i.e. FP7 or H2020), the patient data are anonymized before use in research work packages.
4.3 Data ownership
In most countries, data ownership has no specific legal status. The data producer is not strictly speaking the owner of the data. A data is a piece of information, and information is free (Frochot, 2011). Only the intellectual creation (i.e. copyright) of the data or the data collection method (structured databases) can be protected.
5.0 How can we market Big Data services?
Since Big Data as a service is only known from 2011, we can consider intuitively there is a lack of marketing, image and knowledge in this issue.
In order to the quantify identify the needs of managers in Life Sciences companies towards IT and Big Data services we launched a survey among 2,565 contacts provided by CBB CapBiotek, the Brittany biotechnology organization. Data analysis (75%) and data visualization (50%) are the most considered important services chosen in order to launch new products (50%) and proof-of-concept (75%). Even most of the responders agree for secure data procedures and treatment (78%), most of them have no ideas of what is data monetization. Interestingly, most of them (62%) are aware that Big Data will provide new information from a mix of sources.
In sum, managers are ready to purchase Big Data services if data collection is anonymized (in health) are data analysis and treatment are secure. This last could be a key success factor for Big Data in science activity-based company.
6.0 What will the business model and the added value for a Big Data services-based activity?
According to Wang (2012), there are three main business models approaches for Big Data. The first one focuses on using data to create differentiated offerings – information based differentiation – (i.e. Google AdSense advertising system), the second one use brokering that augments the value of information – information based brokering – (i.e. Bloomberg delivering additional analysis insights) and third one involves content and information providers and brokers who creates delivery networks enabling the monetization of data – information based delivery networks. The data monetization issue is considered as the fourth step (about five) of the “Big data Business Model Maturity” chart proposed by Schmarzo (2012). Organizations try to sell their data with analytics to other organizations, or create “intelligent” products, or transform their customer relationship by levering actionable insights (Schmarzo, 2013).
In Life Sciences, scientists and researchers, and biotech and pharma managers are quite aware about the amount of data generated by biology but not about the tools available to handle and manage the data. They accustomed to use and handle omics data. But they have little idea of what kind of relevant information they can expect from Big Data Services.
First of all, a Big Data -based activity should demonstrate potential applications (Table 4) and benefits for customers segments. Managers of biotechnologies and pharmas are expecting actionable information that could help them to make decision about to continue to develop a therapeutic molecule through long and expensive clinical trials for instance. In health domain, big pharma managers and most of the governments are probably overestimating the benefits of Big Data in reducing public health costs. Big Data services will have to provide specific relevant, actionable and confidential information for Life Sciences managers (Fig 3.)
Human genome sequencing took about 10 years (Dubelco, 1986) (declared complete in April 2003), nowadays it will take one week using NGS (New Generation Sequencing) improving by 10,000 times the sequencing costs and 100 times according to Moore’s Law prediction (Delort, 2012). However, it will take decades to extract all the valued and actionable information from this genome (i.e. link between genes and diseases, link between non-coding sequence and diseases).
According to the so-called Gartner hype curve or cycle for 2014 (Rivera, 2014), among emerging IT technologies, Big Data is already on the downslope of the peak of expectations, and will reach the trough of disillusionment soon. Surprisingly, Data science is on the upslope of the peak of expectations. It means, first, so far Big Data was mainly a buzz word with no relevant content, and second, the business model of Big Data based activity will require human expertise to produce business intelligence.
However, the compound annual growth for Big Data technology and services is expected to be about 26,24% and will reach $41,52 billion in 2018 (IDC, 2014).
Moreover Forbes talked about a $125 billion Big Data Analytics Market in 2015 (Press 2014). However, there is no market shares data available for Big Data in Life Sciences. Managers of Biotechnology and pharmas, as well as research laboratories directors, are looking for more actionable information compare to what are used to get from the omics data studies. Although new computing technologies (i.e. parallel computing) and cloud computing will allow additional data acquisition and treatment, dual competent data scientist will be needed to refine data information processes and results. Secure, relevant, appropriate and valued information generated by BIG Data technologies and services will be a guarantee for a sustainable business model.
The Big Data area where the meeting expectations are the greatest is health because of the public health issues and challenges for most of the countries. However, Big Data in health is constrained by the lack of data management and governance in pharmas and by the personal data protection issue.
Nevertheless because of the large amount of data available in Life Sciences, either structured or unstructured, Big Data technologies together with data science expertise services (bioinformaticians), there is no doubt that a Big Data-based activity is first sustainable and second will be able to produce valuable information for biotechnologies and pharmas in order to improve or accelerate their development.
In the end, Big Data in Life Sciences will benefit from the practice of structured data but will develop only with the promise that Big Data services will be the new discovery tools as it was for researchers.
We would like to acknowledge Gilbert BLANCHARD, Director of CBB Capbiotek for his help and support during the study. Special thanks to Lilybelle MALAVÉ for her support.
« Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information? »
September 10, 2015 | Corresponding Author Guy Mordret, Ph.D | doi: 10.14229/jadc.2015.10.10.001
Received: July 29, 2015 | Published online September 10, 2015
Disclosures: Guy Mordret, Ph.D is an employee of Anaximandre Ltd., Parc d’Innovation de Mescoat, 29800 Landerneau, France.