In the past couple of weeks, from the CLO breakfast meeting interactions to the thousand odd conversations (who’s counting now) at DevLearn 2013 Las Vegas, one topic that kept popping up often was Big Data. So, needless to say, it is a trend and a “Hot” one indeed. However, I should add here that as with any new trend, there is a lot of confusion, scrambling to understand and to be in-the-know, a lot of misconceptions, apprehensions etc. you name it and you could sense that. I am reminded of yet another trend not very long ago, called “Cloud-Computing”. The most hilarious and I would dare say the most extreme misconception was a politician in a foreign country who was blabbering off on cloud computing about how advanced technology is these days that they have computers in the cloud !!! I promise this is not made up. Look it up on YouTube. Although I should admit that I am yet to see such outrageous misconceptions about Big Data, nevertheless, there are plenty of misconceptions, however, genuine they might be.
So, I thought it would be a good idea to write a blog on this and attempt to clarify some doubts and also highlight some of the huge benefits and potential that Big Data and Big Data Analytics could bring to the L&D org and the CLO organization as a whole. I would also through this blog throw some challenges to L&D leaders out there to come up with compelling use-cases and problems to be solved by Big Data professionals like me.
First of all let me state the obvious here. Big Data is a set of technologies that allow massively large data (big data) to be processed and analyzed to gain new insights in business. Big Data is the new paradigm and technology stack that promises to transform and bring a metamorphic change to our IT and technology which would empowers business leaders with great insights and new analysis hitherto not possible or difficult to obtain.
There are a lot of statistics out there that justify the term Big Data. The amount of all digital data in the world today is about 2.7 Zetabytes of data (now go google it to find out what a zetabyte is J ). 90% of all that digital data in the world is created in the last 2 years. That should give you some idea of the growth of this data. If I may use a layman’s mathematical term, it is “beyond” exponential growth!! Consider some of these data numbers:
- Facebook has over 45 billion photos
- Amount of data uploaded into Facebook is over 100 Terabytes everyday
- Brands and Organizations on Facebook receive about 34 thousand likes every minute
- There are over 230 million tweets everyday
- Walmart handles over 1 million transactions every hour
- There were over 4.8 trillion Ad impressions in 2011
- It is estimated that 570 new websites that are created every minute
- 48 hours of video is uploaded into You atube by users every minute
If this massive amount of data did not yet spin your head, think of the amount it is going to be in the next couple of years. Now imagine all the mundane sources of data in your own domains and organizations. Here are a few examples of data sources that you would have thus far never thought of as data that could be mined for intelligence:
- The badge scans of employees entering and exiting every building in your organization’s campus
- The logs from servers that provide clues to user behavior on the use of your application / website
- Access timestamps of your courses from LMS
- Click timestamps on each slide / page of the course, providing vital information of time taken to compete etc. etc.
- Click history and time stamps of users behavior on practice exercises within an e-learning course.
- In an in-class training session, there are umpteen data points is out there, like the time spent on each slide, the number of times the slides were re-presented, the amount of time a quiz slide was up etc. etc.
- Number of tweets generated during the in-class training session related to the training, the follow-up / re-tweets etc.
One thing that you can see as common from the above is that these are all streams of data i.e. continuously data gets produced and is usually routinely lost as it is not captured and mined. However, with Big Data, we now have tools, techniques and capabilities to capture, analyze and mine these sources of data, combine it with other data and thus add tremendous value. I want to bring to you yet another piece of data that might calm your nerves. The Human Genome project initially took 10 years but with today’s technology it would be done in a week. So we can say “We’ve got the POWER!!!”
Let me give you an example of a use-case where Big Data and Big Data Analytics is used which would give a very good understanding of how all the sources of data such as the ones mentioned above can be harnessed and mined for intelligence. Department of Homeland Security and NSA routinely process huge amounts of data from all types of chatter around the world with special emphasis on terrorist cells / sleeper cells or potential dangerous networks and analyze them to find patterns of conversation, trends of keywords, tying people together etc. to identify potential security threats and generate “intel” (intelligence) that can be provided to authorities to take action. An example of one such thing was recently seen in the nabbing of Bin-Laden and various other leaders of terror networks around the world.
So let us look a little deeper into the example above to understand the traits and that can provide some insights into the types of things possible in L & D domain, and also help clarify as to what Big Data and Big Data Analytics does and what distinguishes this from traditional analysis, Business Intelligence and traditional database / data (although big in size).
The primary sources of data in the above example were social media chatter data combined with several different data sources like lists of persons of interest, financial sources and international fund movements and several other databases from FBI / CIA / other law enforcement agencies. A lot of this data that is brought in is streaming data from social media and constant analysis is done to mine the data to discover trends in data, predictive analysis of coming up with patterns that can be highlighted. You can also imagine the types of analysis techniques that went into it. At least I could imagine all statistical analysis techniques like regression analysis, etc. Things like genetic algorithms, all the optimization mathematical techniques that I learnt in my engineering, Artificial Intelligence, machine learning, neural networks etc. etc. you name it and I can see it being there and being used. This is at the top layer, the analytics engine layer and the things that drive / power the analysis. A layer above this analysis layer is the visualization layer. Visualization layer provides means for the user to slice and dice data and visualize many trends / search and discover patterns etc. A layer below the analysis layer is where we can see a whole lot data being retrieved from the databases into memory and huge / massive processing power. Usually in-memory databases are used. Next moving onto a layer below this, is the data layer where we can see data that is brought in from traditional databases as well but data modeled in a way that is flexible to find different ways of slicing and dicing it to mine and find patterns. Below this layer is another layer called ETL (Extract, Transform, Load), which basically is the layer and tools used to get data from several sources and into the stack for analysis. Underneath this is the more infrastructure pieces of database and storage that is used to store these massive amounts of data.
So, enough of the nerdy tech talk and on to more plain English lay man’s talk. Let us look at Big Data implementation / infrastructure characteristics that distinguish it from traditional large database and business intelligence. The following will serve as a litmus test to identify and visualize many truly big data projects that will help the L&D and CLO organization.
- A lot of streaming data from mundane data sources.
- User behavior data
- Unstructured data without any pre-determined structure (columns / fields defined)
- Real-time analytics (the analysis being done in real-time as opposed to something that you, for instance, create a survey and collect data to gather insight)
- Of course massive amounts of data (Big Data)
- Sophisticated analysis techniques involving some / any of the following
- Statistical analysis techniques like regression analysis etc.
- Artificial Intelligence / expert systems
- Machine learning
- Neural network, genetic algorithm or other algorithms
Big Data personnel often use these 4 “V”s to describe the characteristics:
- Volume
- Variety
- Velocity
- Value
The fourth “V” is the most important and the one where L&D leaders / CLOs can leverage and ride this Big Data trend / technology.
I hope I have done some de-mystifying of this hot trend for the benefit of the L&D leaders and CLOs and also try to motivate them to think about the enormous value that they can derive out of this leveraging of Big Data. I welcome questions and comments on this blog and will be glad to help clarify and provide more resources.
Please reach out of me at Sitarama@ideaoninc.com or to any of our staff at Ideaon Inc. and we can help you with your ideas of leveraging big data and we can help you all the way from designing to implementation and training in these areas for your people.
Thanks,
Shankaran Sitarama
CTO and Director – Technical
Ideaon Inc