What the !@#% is Data Science?

Jesse English & Sarah Kelley
&
Carney Labs, LLC
April 22, 2015

About Me

  • CS PhD from UMBC in 2010 (ML and NLP)
  • Postdoc and current research in semantic reasoning and knowledge graphs
  • CSO at Unbound Concepts
  • Data Science Consulting
  • Data Team Lead at Carney Labs, LLC

Overview

  • What exactly is data science?
  • What does it take to be a data scientist?
  • Why do domains matter?
  • What is a data team?
  • What is "Big" Data?
  • I got a job! Now what?

What isn't data science?

Some common misconceptions

Processing big data!

Finding trends and patterns!

Wowing the board with snazzy infographics!

What is data science?

Data science seeks to:

  1. elicit meaning or value from data and / or
  2. use data to answer a question, or solve a problem

What is a data scientist?

Which of these are data scientists?

  • A Javascript expert who enjoys making really cool interactive graphs and visualizations.
  • A programmer that with distributed computation experience and related technical skills (e.g., Java & Hadoop).
  • A business intelligence analyst who makes reports based
    on what they think is important in the data.

What is a data scientist?

Those examples were missing a critical component:

"A data scientist hypothesizes what they think is important and then uses algorithms to test their hypotheses."

What skills does a data scientist need?

statistics data visualization graph theory
business acumen bayesian algorithms hidden markov models data collection data organization data ingestion data manipulation data collation data normalization natural language processing natural language understanding machine learning categorization clustering analysis MapReduce graph algorithms knowledge acquisition knowledge engineering cs algorithms efficiency algorithms distributed computing pattern recognition

Data Disciplines

Let's break that mess down into four core displines.

*note: there will be some 'cross-contamination'

- Data Engineers -

Focus on software and practical solutions.

  • MapReduce
  • data collection
  • data normalization
  • natural language processing
  • graph algorithms
  • cs algorithms
  • efficiency algorithms
  • distributed computing

- Data Analysts -

Focus on picking through and understanding the data.

  • statistics
  • bayesian algorithms
  • categorization
  • clustering
  • pattern recognition
  • business acumen

- Data Researchers -

Focus on investigation and hypothesis, plus model building.

  • data collection
  • natural language understanding
  • machine learning
  • graph algorithms
  • knowledge acquisition

- Data Creatives -

Focus on actionable data (in and out) and big picture solutions.

  • data visualization
  • hidden markov models
  • data normalization
  • natural language understanding
  • graph theory
  • knowledge engineering

Data Disciplines - Pick 3

An ideal data scientist has a skill set that crosses at least 3 of the disciplines.

*YMMV in any given environment.

Now how do you use those skills?

Practical Tools

Don't reinvent the wheel ... unless you're also reinventing the road.

Skill Tool
data collection, graph algorithms PostgreSQL, MongoDB, Neo4J
natural language processing NLTK, Stanford CoreNLP
MapReduce, distributed computing Hadoop, AWS
machine learning, clustering, categorization scikit-learn, Weka Toolkit
data visualization R, Highcharts
knowledge engineering Protege (OWL)

Hooray! I'm a Data Scientist... Right?

So you've mastered a few skills from across the four disciplines...

You've familiarized yourself with suites of tools to help you out...

What more do you need?

Numbers, Charts, Graphs

What does this graphic show?

* image courtesy of Anony-Mousse@stack-exchange

Domain Expertise

You have to know what the data you're dealing with is all about.

Don't just start clustering or detecting relationships - that becomes meaningless.

A good data scientist has domain expertise in the area they are data-sciencing. (e.g. linguistics for NLP)

Domain Adaptability

Better than domain expertise, is domain adaptability.

Adaptable data scientists can catch quickly up to speed in an arbitrary domain.

They can serve as the middleman between the data team and full domain experts.

Data Teams

"Beware the Unicorn"

- Online Retailer Data Team

- Academic Research Data Team

- Predictive Analytics Data Team

- Research and Development Data Team

Online Retailer Data Team

Very large volumes of inventory data and user view / purchase logs.

Collaborative filtering + light knowledge analytics.

Tracking trends, applying good business sense.

Academic Research Data Team

Wide variety of domains.

Pure research drive; several steps before any commercialization.

Predictive Analytics Data Team

Identify trends in large-volume data sets.

Seeks to produce models to accurately adapt to changes in trends.

Broad domain applicability, but commonly found in business settings.

Research and Development Data Team

Create new experiences based on identified user / customer desires.

Pre-commercialization, but would work with a business team towards the end of the lifecycle.

Data Science Life Cycle

Big Data

* and other data too

"Big" Data is a nebulous and silly term; it really just means "more data than yesterday".

Having Big Data presents certain practical problems (e.g., processing runtime).

Blindly pursuing Big Data presents (likely) a domain understanding failure.

Enough Data

"Enough" Data is not a hot-topic buzzword.

It does, however, represent what you actually want.

How much data is "enough" will vary between domains and applications.

How many bazilliabytes of data do you need?

  • Chemistry / physics ML application = BIG.
  • Social cause / effect predictor = SMALL.

Side-channel Data

Side-channel data is data that can be captured or generated as a result of normal use / processing.

Very useful in almost every application.

Generally can't be recovered, so capture it even if you think you don't need it.

  • Logging user activity, timeouts, etc.
  • Time to process, errors, warnings, and hardware activity levels for batch calculations.
  • Environmental variables "unrelated" to the task at hand (audio levels, lighting).

MapExpand

Take your small ("enough") data and expand it by pre-calculating use-cases.

Apply markup, meta data, etc. (side-channel data is useful here).

When possible, pre-calculate: HD space is cheap, losing users to slow response times is not.

Real World Expectations - Practical Stuff

As varied as anyone's definition of data science.

Almost certainly involves programming (at least scripting, or database work).

Be prepared to interact with live / changing data sets.

Produce actionable "models" (pattern recognizers, predictors, etc.).

Real World Expectations - Friendly Advice

"You're sailing the ship!" - you are the go-to person for anything with the word data near it.

Be prepared to be asked questions well outside your domain or scope of duties:

  • This is fine - it's also normal!
  • Be confident, but don't make things up.
  • If you need to pull in an expert, make that clear, do so, and learn from it.

Cynical Summary

(domain expertise) + (algorithms) + (corpora) = data science;
(computer science) + (adaptability) + (databases) = data science;

But of course, the most important part is what you do with the data.

Questions?