Jesse English & Sarah Kelley
Carney Labs, LLC
April 22, 2015
Processing big data!
Finding trends and patterns!
Wowing the board with snazzy infographics!
Data science seeks to:
Which of these are data scientists?
Those examples were missing a critical component:
"A data scientist hypothesizes what they think is important and then uses algorithms to test their hypotheses."
Let's break that mess down into four core displines.
*note: there will be some 'cross-contamination'
Focus on software and practical solutions.
Focus on picking through and understanding the data.
Focus on investigation and hypothesis, plus model building.
Focus on actionable data (in and out) and big picture solutions.
An ideal data scientist has a skill set that crosses at least 3 of the disciplines.
*YMMV in any given environment.
Now how do you use those skills?
Don't reinvent the wheel ... unless you're also reinventing the road.
|data collection, graph algorithms||PostgreSQL, MongoDB, Neo4J|
|natural language processing||NLTK, Stanford CoreNLP|
|MapReduce, distributed computing||Hadoop, AWS|
|machine learning, clustering, categorization||scikit-learn, Weka Toolkit|
|data visualization||R, Highcharts|
|knowledge engineering||Protege (OWL)|
So you've mastered a few skills from across the four disciplines...
You've familiarized yourself with suites of tools to help you out...
What more do you need?
What does this graphic show?
* image courtesy of Anony-Mousse@stack-exchange
You have to know what the data you're dealing with is all about.
Don't just start clustering or detecting relationships - that becomes meaningless.
A good data scientist has domain expertise in the area they are data-sciencing. (e.g. linguistics for NLP)
Better than domain expertise, is domain adaptability.
Adaptable data scientists can catch quickly up to speed in an arbitrary domain.
They can serve as the middleman between the data team and full domain experts.
- Online Retailer Data Team
- Academic Research Data Team
- Predictive Analytics Data Team
- Research and Development Data Team
Very large volumes of inventory data and user view / purchase logs.
Collaborative filtering + light knowledge analytics.
Tracking trends, applying good business sense.
Wide variety of domains.
Pure research drive; several steps before any commercialization.
Identify trends in large-volume data sets.
Seeks to produce models to accurately adapt to changes in trends.
Broad domain applicability, but commonly found in business settings.
Create new experiences based on identified user / customer desires.
Pre-commercialization, but would work with a business team towards the end of the lifecycle.
"Big" Data is a nebulous and silly term; it really just means "more data than yesterday".
Having Big Data presents certain practical problems (e.g., processing runtime).
Blindly pursuing Big Data presents (likely) a domain understanding failure.
"Enough" Data is not a hot-topic buzzword.
It does, however, represent what you actually want.
How much data is "enough" will vary between domains and applications.
How many bazilliabytes of data do you need?
Side-channel data is data that can be captured or generated as a result of normal use / processing.
Very useful in almost every application.
Generally can't be recovered, so capture it even if you think you don't need it.
Take your small ("enough") data and expand it by pre-calculating use-cases.
Apply markup, meta data, etc. (side-channel data is useful here).
When possible, pre-calculate: HD space is cheap, losing users to slow response times is not.
As varied as anyone's definition of data science.
Almost certainly involves programming (at least scripting, or database work).
Be prepared to interact with live / changing data sets.
Produce actionable "models" (pattern recognizers, predictors, etc.).
"You're sailing the ship!" - you are the go-to person for anything with the word data near it.
Be prepared to be asked questions well outside your domain or scope of duties:
(domain expertise) + (algorithms) + (corpora) = data science;
(computer science) + (adaptability) + (databases) = data science;
But of course, the most important part is what you do with the data.