If you’re thinking of attending your first data science workshop, you probably already figured out what data science is. Chances are you’ve already written some code in R or Python, know that big data is changing the world, and believe that data science is still the sexiest job in America.
In addition to understanding the importance of data science, do you know the difference between supervised and unsupervised learning? Do you know what functions are or how to create visualizations? These are some terms that you should be familiar with before heading into a Thinkful workshop. As a current student and TA in Thinkful’s data science program, I put together this list to help you better prepare to ask questions and understand the information you’ll encounter in one of our workshops.
Variables are reserved memory locations that store values like integers, decimals, or alphanumeric characters (called strings). You can think of a variable as a nickname being given to something so you can easily refer back to it.
In the second example above, you will notice the use of the = sign. This allows you to simply type in the letter “b” instead of 1.678090989 for future calculations.
Functions combine many instructions into a single line of code. Python, a language we discuss frequently in our workshops, has many built-in functions including print() which tells your computer to print whatever is within its parentheses and abs() which returns the absolute value of a number.
It’s also possible to create your own function. To do so, start by typing the keyword def which lets Python know you’re about to define a function. Next, give the function a name, and then pass in optional parameters. Parameters are the data you pass into the function. Inside of the function, write out the set of instructions you’d like the function to perform. You can then call (use) the function later by simply typing the function name and any parameters.
Many programming languages make use of arrays, which are used to store multiple values in one variable. However, arrays aren’t common in Python. Therefore, when people refer to arrays in a data science context, they are most likely talking about a list – a data type commonly used in Python. Dictionaries, set, and tuples are three other data types used in Python. Still confused? Check out this article about arrays.
In Python, lists are an ordered collection of values (of any type). They are mutable (which means they can be changed), their values are accessed by their index, and they’re represented by square brackets. An index is the numeric position of an item within the list (first, second, last). Python is zero indexed so the first index is actually 0.
In the example above, I created a list called ‘MyList’ that contains the integer 1, a float 2.0, and a string “three”. Then I used the print statement to print the second item in ‘MyList’, which will give me 2.0. Output is what actually shows up on your screen.
A dictionary is a collection of values which is unordered, mutable, and indexed. They’re written with curly brackets and consist of key values pairs. Instead of referring to an index, you access values in the dictionary by referring to the associated key (a named location).
In the above example I created a dictionary called “MyDictionary”, then I gave it three keys (Name, Title, and Age) and provided each key with a value (key-value pairs). Below the dictionary I printed out the title key, which output “Wizard”.
Packages and Modules
A package (sometimes referred to as a library) is a collection of modules (files) containing functions that can be used by other programs. A program is the full set of code utilized for a project. Typically, a package’s modules are all related. For example, the math package consists of many different mathematical functions like power (exponentials) and logarithms, among other helpful mathematical tools.
The Data Science Tool Kit
Data scientists use certain tools often referred to as the ‘the data science toolkit’. The toolkit is a combination of database software and Python libraries that a data scientist utilizes frequently. Most databases in data science are some variation of Structured Query Language (SQL). The packages utilized typically are NumPy, Pandas, MatPlotLib / Seaborn, and SciKit-Learn. These packages allow for mathematical manipulation, dataframes, plotting, and machine learning respectively.
While these packages are used most frequently, it’s easy and often necessary to import different package. For example, for data visualization Bokeh is used instead of Seaborn.
About the Author
Ashley Simpson is a data science workshop instructor in Chicago, and studied with Thinkful. Ashley is passionate about bridging the gap for non-techies to the technical world, and her main interest is diversity in AI to help create better models and outcomes