This article aims to define key terms commonly used in data science and artificial intelligence, so that beginners in these areas can quickly understand how they are linked, and the differences between them.
Data science is a field of study that involves collecting, cleaning, modeling (using AI techniques), and presenting/visualizing unstructured or structured data, in order to extract useful knowledge and insight from this data. Data alone is just a bunch or numbers or symbols, without any meaning to the human eye. Data science involves manipulating this raw data and transforming it into something that is human-readable (knowledge and information). So data science applies AI techniques to analyze and model data. Some related areas are:
- Data mining deals with large data sets (hence, big data) that are structured, using machine learning, database systems and statistics.
- Information science deals with the analysis, manipulation, storage and retrieval of (usually unstructured) data, and transforming it into useful knowledge. Information retrieval in search engines is a popular problem area.
Artificial intelligence (AI) is sometimes called machine intelligence, and uses computer science to enable computers to think like humans, by training them with large amounts of data. The aim is to reach a point where a computer can perform a human task more efficiently than a human. You need to teach AI machines intelligence using data, so AI also uses techniques from data science to analyze any input data. Some examples are Siri, facial recognition, self-driving cars, and fake news detection. Prominent branches in AI are (1) machine learning, (2) natural language processing, (3) robotics, and (4) computer vision. More on these below:
(1) Machine learning is a task that employs supervised or unsupervised models that can automatically learn how to perform a certain task, or predict a certain outcome, without the need to be instructed how to do so explicitly.
- Supervised machine learning involves feeding a classifier model with labeled training data. This data comprises pairs of input data and their correct outputs. The goal is to have the classifier ‘learn’ how to predict the correct output for unseen inputs, using the given training data. An example is to give a classifier a set of positive Tweets and a set of negative Tweets, and then teach it to predict the sentiment polarity of new Tweets it has not seen before.
- Unsupervised machine learning involves using minimum human supervision or training data, and have a model predict underlying patterns within the input data itself (with no pre-existing training data) in order to predict the output. Clustering is a good example, and involves grouping a set of data items that are similar to each other into a cluster. Data items in a cluster are more similar to items in the same cluster than items in other clusters.
(2) Natural language processing deals with interactions between computers and humans, and enabling computers to understand and analyze large amounts of human language (e.g. English). This further branches into natural language generation, natural language understanding and speech recognition. This overlaps largely with computational linguistics, text mining and text analytics.
(3) Robotics involves the design, construction and operation of robots. These can be industrial robots, military robots, domestic household robots, medical robots, drones, etc.
(4) Computer vision deals with enabling computers to understand from digital images and videos. The aim is to allow a computer to ‘see’ similar to the way humans do.
That’s all for now! You should now have a brief understanding about each of these areas, how they are similar, and how they are different. Thoughts? Please comment below! 🙂