Data Mining MCQs

Data Mining MCQs

These Data Mining multiple-choice questions and their answers will help you strengthen your grip on the subject of Data Mining. You can prepare for an upcoming exam or job interview with these 100+ Data Mining MCQs.
So scroll down and start answering.

1: Which industry can benefit from data mining?

A.   All of these

B.   Retail

C.   Manufacturing

D.   Finance/Banking

2: With which of these layers does a neural network start?

A.   Output Layer

B.   Hidden Layer

C.   Transparent layer

D.   Input layer

3: Changes to parts of a code could lead to the problem of ______________ data.

A.   inconsistent

B.   dirty

C.   nonintegrated

D.   granular

4: In a neural net, to what does topology refer?

A.   The range of variables in a set

B.   The number of nodes utilized

C.   The graphical visualization of the data

D.   The number of layers and the number of nodes in each layer

5: Which of the following clustering algorithms can find clusters of arbitrary shape?

A.   Single-Link

B.   DSBSCAN

C.   Both of these

D.   None of these

6: Decision trees are able to handle missing values without using any impute transformation. True or False?

A.   False

B.   True

7: A(n) _____ algorithm creates rules that describe how often events have occurred together.

A.   CHAID

B.   artificial

C.   pruning

D.   associative

8: Which of the following is valid XML?

A.   <body answer="valid">This One</body>

B.   <valid>This One</valid>

C.   <valid>"This One"</valid>

D.   All are valid

9: Which of the following is not a relational database?

A.   All of the above

B.   Apache Cassandra

C.   Google Big Table

D.   MongoDB

10: What is data visualization?

A.   The technical term for the act of data being stored in a server

B.   A structured and developed prediction of data results

C.   The visual interpretation of complex relationships in multidimensional data

11: What is a KDD Process?

A.   Differential Decryption

B.   Knoop-hardness measured through high-impact dimension

C.   Knowledge Discovery in Databases

D.   K-mean Data Discovery

12: Which of these are NOT types of analytical software:

A.   All are valid types

B.   Neural network

C.   Statistical

D.   Machine learning

13: True or False? Economic indicators are external data factors.

A.   False

B.   True

14: Which of the following disciplines overlaps Data Mining?

A.   All of the above

B.   Artificial Intelligence

C.   Statistics

D.   Linguistics

15: In predictive models, the values or classes to be predicted are called the:

A.   Dependent

B.   All of these

C.   Response

D.   Target variables

16: You are a credit risk manager of a retail bank. Some information about customers are available to analytics. Based on this data you have to decide that a person will be a good or bad customer. Choose the appropriate data mining task for this business problems.

A.   Classification

B.   Regression

C.   Segmentation

17: Data items grouped into relationships and preferences are known as:

A.   Predictable Sets

B.   Punctional Organizations

C.   Degrees of Fit

D.   Clusters

18: What are decision trees?

A.   Complex reports generated by a qualified data scientist

B.   Hierarchical dimensions that can be created with a hyper cube browser

C.   Data not collected by the organization, such as data available from a reference book

D.   Structures that generate rules for the classification of a dataset

A.   Relational Learning Models

B.   Decision Trees and Rules

C.   All of these

D.   Probabilistic Graphical Dependency Models

20: True of False? Loose coupling data mining architecture is mainly for memory-based data mining systems that does not require high scalability and high performance.

A.   False

B.   True

21: What is CRISP-DM?

A.   A decision tree developed in the 1980's but almost entirely replaced by the CART method today

B.   A six phase method for predicting e-commerce buying habits

C.   Microsoft's linear regression algorithm

D.   A cross-industry standard process for data mining

22: A function used by a node in a neural net to transform input data from any domain of values into a finite range of values is known as a(n):

A.   Antecedent

B.   Activation Function

C.   Confusion matrix

D.   Chi-square

23: True or False? Tests in CART are always Binary.

A.   True

B.   False

24: What is the measure of how much two random variables change together?

A.   binary standard deviation

B.   covariance

C.   polyconvergence

D.   stochastic inertia

25: Which of these is an example of a sequential pattern relationship?

A.   Using business experience and gut instinct to design a new floorplan in a grocery store

B.   Reorganizing your basketball team's starting lineup based on an analysis of performance

C.   Placing two frequently purchased items next to each other on the shelf

D.   Predicting the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes

26: The annual revenue of an international company is correlated with other attributes like advertisement, exchange rate, inflation rate etc. Having these values (or their reliable estimations for the next year) the company have to calculate its expected revenue for the next year. Choose the appropriate data mining task for this business problem.

A.   Segmentation

B.   Classification

C.   Regression

27: What is the front end layer of data mining architecture?

A.   An intuitive and user friendly user interface

B.   Firewalls established to protect data from malicious sources

C.   The hardware designed specifically for storage of massive amounts of data

D.   The team of programmers who designed the software utilized in a particular mining project

28: A hyperplane is a

A.   decision boundary separating classes of data

B.   variant of the C4.5 algorithm

C.   collection of linked hypertext files

D.   non-terminating error condition

29: Data not collected by the organization, such as data from a proprietary database, that is combined with the organization’s own data is known as:

A.   Overlay

B.   Overfitting

C.   Noise

D.   Non-applicable date

30: Which of these are NOT considered internal data factors?

A.   Price

B.   Economic downturns

C.   Staff Skills

D.   Product Positioning

31: Which data mining technique organizes sets of data into predefined groups?

A.   Sequential Patterning

B.   Clustering

C.   Classification

D.   Gamification

32: The level of the model that specifies (often graphically) which variables are locally dependent on each other.

A.   Structural Level

B.   Qualitative Level

C.   Primary Level

D.   Quantitative Level

33: To increase the confidence of your state of classification performance on the entire population, you should:

A.   Decrease the size of the training dataset

B.   Increase the size of the training dataset

C.   Increase the size of the test dataset

D.   Decrease the size of the test dataset

34: The algorithm powering the Google search engine is:

A.   AdaBoost

B.   The Brin-Page Method

C.   GoogleCrawler

D.   PageRank

35: In the association between two variables, what is the difference between the antecedent and the consequent?

A.   The antecedent is always a very complex variable

B.   Nothing, they are interchangeable

C.   The antecedent is on the right, the consequent is on the left.

D.   The antecedent is on the left, the consequent on the right

36: In the analysis of time-series data, the mean value over a given time period (usually some interval in the past up to the present) is called a(n)

A.   partial average

B.   unbiased mean

C.   compounded mean

D.   moving average

37: What is Regression?

A.   Learning a function that maps a data item into one of several predefined groups.

B.   An expression E in a language L describing facts in a subset FE of F.

C.   A descriptive task where one seeks to identify a finite set of categories to describe the data.

D.   Learning a function that maps a data item to a real-valued prediction variable.

38: What is Dependency Modeling?

A.   A multi-step process involving data preparation, pattern searching, knowledge evaluation, and refinement with iteration after modification.

B.   Learning a function that maps a data item into one of several predefined groups or clusters.

C.   The process of finding a model which describes significant dependencies between variables

D.   A task which consists of techniques for estimating, from data, the joint multi-variate probability density function of all of the variables/fields in the database.

39: Which of these is NOT a common descriptions of layers?

A.   Hidden

B.   Input

C.   Output

D.   Functional

40: Sharding refers to:

A.   a measure of the noise in a database's contents

B.   partioning a database for distribution across different servers

C.   simultaneously accessing multiple object databases over SSH

D.   none of the above

41: What is Change and Deviation Detection?

A.   A task focusing on discovering the most significant changes in the data from previously measured or normative values

B.   Methods for finding a compact description for a subset of data.

C.   The process of finding a model which describes significant dependencies between variables

D.   A task which consists of techniques for estimating, from data, the joint multi-variate probability density function of all of the variables/fields in the database.

42: What is the type of data mining that drives the Amazon.com recommendation system?

A.   Fuzzy Logic

B.   Association Learning

C.   Anomaly Detection

D.   Clustering Algorithms

43: Which of the following algorithms is generally suitable for unsupervised learning tasks?

A.   Restricted Boltzmann machine

B.   info-fuzzy networks

C.   k-nearest neighbor

D.   k-means algorithm

44: Which of the following storage solutions is most appropriate for a semi-structured dataset whose members do not all have the same attributes?

A.   MongoDB

B.   SQLite

C.   MySQL

D.   MariaDB

45: In order to estimate classification performance on an entire population, you need _______

A.   (None of these)

B.   Disjoint training

C.   Test Datasets

D.   disjoint training and test datasets

46: Generalization error is a consequence of

A.   Overfit

B.   Parametric analysis

C.   Underfit

D.   Poorly defined Chernoff Bound

47: Which of these are evolutionary computational methods?

A.   Heuristic algorithms

B.   Bayesian inference algorithms

C.   Genetic algorithms

D.   Clustering algorithms

48: Support Vector Machines have an advantage over Neural Networks because SVM's are

A.   none of the above

B.   easier to train via online learning

C.   more resistent to local minima convergence

D.   parametric

49: Which of the following is NOT a common source system?

A.   Node

B.   SAP source

C.   UDC

D.   DB Connect

50: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset is:

A.   Nearest Neighbor

B.   Logistic Regression

C.   Association Model Query

D.   Decision Treeing

51: What is the extraction of useful if-then rules from data based on statistical significance?

A.   Preliminary Method Mapping

B.   Rule Induction

C.   Fuzzy Logic Application

D.   Dynamic Information Inference

52: What is Classification?

A.   Methods for finding a compact description for a subset of data.

B.   Learning a function that maps a data item into one of several predefined groups.

C.   A discovered pattern that is true on new data with some degree of certainty, and generalizes to other data.

D.   A descriptive task where one seeks to identify a finite set of categories to describe the data.

53: Which of the following is NOT a function of data warehouses?

A.   Cleaning dirty data

B.   Extracting data

C.   Cleaning data

D.   Storing purchased data

54: True or False? The MARS algorithm cannot produce rules.

A.   True

B.   False

55: Which of the following is most appropriate for finding the shortest chain of friends linking two people in a social graph who are not friends with each other?

A.   k-means algorithm

B.   Markov chains

C.   Dijkstra's algorithm

D.   Neural Networks

56: Which of the following is not a common goal of the KDD Process:

A.   Description

B.   Performance

C.   Prediction

57: What is a genetic algorithm?

A.   A search algorithm that enables us to locate optimal binary string by processing an initial random population of binary strings by performing operations such as artificial mutation, crossover and selection.

B.   An algorithm that estimates how well a particular pattern (a model and its parameters) meet the criteria of the KDD process. Evaluation of predictive accuracy (validity) is based on cross validation. Evaluation of descriptive quality involves predictive a

C.   A classic algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item s

58: What is Interestingness?

A.   An overall measure of pattern value, combining validity, novelty, usefulness, and simplicity.

B.   An expression E in a language L describing facts in a subset FE of F.

C.   A multi-step process involving data preparation, pattern searching, knowledge evaluation, and refinement with iteration after modification.

D.   A discovered pattern that is true on new data with some degree of certainty, and generalizes to other data.

59: In the MapReduce model, Map and Reduce functions act directly on which kind of data structure?

A.   MySQL matrices

B.   linked lists

C.   relational databases

D.   key-value pair

60: In Natural Language Processing, what is the role of a lexical analyzer?

A.   checks the validity of a token

B.   splits the stream of input characters into tokens

C.   generates a context-free grammar

D.   processes the parse tree for semantic meaning

61: What is Clustering?

A.   A task which consists of techniques for estimating, from data, the joint multi-variate probability density function of all of the variables/fields in the database.

B.   A descriptive task where one seeks to identify a finite set of categories to describe the data.

C.   Learning a function that maps a data item into one of several predefined groups or clusters.

D.   The process of finding a model which describes significant dependencies between variables

62: A DBMS reduces data redundancy and inconsistency by

A.   Utilizing a data dictionary

B.   uncoupling program and data

C.   Minimizing isolated files with repeated data

D.   Enforcing referential integrity

63: In which type of analysis is a Kohonen feature map typically employed?

A.   Descriptive modeling analysis

B.   Cluster analysis

C.   Exploratory data analysis

D.   Predictive analysis

64: Which of the followng clustering algorithms can optimize an ojbective function?

A.   DSBSCAN and Single Link

B.   k-means and CLARANS

C.   k-means only

D.   Subspace Clustering Algorithms

A.   Linear regression

B.   Clustering

C.   Knowledge

D.   Meta-data

66: Which of the following properties applies to Single-Layer Perceptrons?

A.   backpropagation

B.   random initalization of weights

C.   continuous output

D.   able to learn non-linear separations

67: Which of the following is NOT a method of combining multiple models into an ensemble model?

A.   Voting

B.   Stacking

C.   Averaging

D.   Bootstrapping

68: What is Summarization?

A.   A task focusing on discovering the most significant changes in the data from previously measured or normative values

B.   A descriptive task where one seeks to identify a finite set of categories to describe the data.

C.   The process of finding a model which describes significant dependencies between variables

D.   Methods for finding a compact description for a subset of data.

69: "In 2% of the purchases at the hardware store, both a pick and a shovel were bought,” is an example of:

A.   Validation

B.   Support

C.   Supervised learning

D.   Topology

70: A commonly used continuous alternative to the step function in multi-layered neural network output is the

A.   logistic function

B.   multi-layered NN cannot compute continuous output

C.   hyperbolic function

D.   logarithmic function

71: What is Pig

A.   A programming language that enables Hadoop to operate as a data warehouse.

B.   None of these

C.   A programming language that simplifies the common tasks of working with Hadoop.

72: Taking multiple random samples of data and building a classification model for each is known as:

A.   Fuzzy Sampling

B.   Binning

C.   Boosting

D.   Clustering

A.   //a/[contains(@href, "profile")]

B.   //a/[contains(@href, "profile")]/@href

C.   //href/profile

D.   //a/profile

74: Which of the following algorithms produces decision trees?

A.   DBSCAN

B.   ID3

C.   none of the above

D.   logistic regression

75: Which of the following properties is a constraint on a RESTful application?

A.   stateless

B.   linearly seperable

C.   returns JSON output

D.   stateful

76: The component of the Hadoop Distributed Filesystem responsible for storing metadata is called the

A.   Datanode

B.   FS Shell

C.   DFSAdmin

D.   Namenode

77: If more than one value occurs the same number of times, the data is:

A.   Multi-faceted

B.   Multi-leafed

C.   Multivariated

D.   Multi-modal

78: What is the first step in the business understanding phase?

A.   Firmly grasp business objectives and needs

B.   Assess the current situation by finding out the resources, assumptions, constraints etc.

C.   Create data mining goals to achieve the business objectives

D.   Create a list of all relevant algorithms to be applied to the task

79: What is CURL?

A.   A command-line tool for retrieving files

B.   A methodology for classifying hidden features of data

C.   The part of HTTP that specifies access permission

D.   Combinatorial Unsupervised Recursive Learning algorithm

80: The level of the model that specifies the strengths of the dependencies using some numerical scale.

A.   Numeric Level

B.   Primary Level

C.   Dependency Level

D.   Quantitative Level

81: Apriori is a seminal algorithm for finding frequent item sets using:

A.   Normal mixture models

B.   Candidate generation

C.   Overfitting methods

D.   None of these

82: The authentication protocol used by many significant web APIs is called:

A.   HTTPS

B.   PGP

C.   OAuth

D.   SSL

83: Which of these is not a step in the KDD process?

A.   Data Integration

B.   Data Mining

C.   Data Cleaning

D.   Data Quantification

84: Which of the following applications are usually used to classify students' performances?

A.   Cluster analysis

B.   If...then... analysis

C.   Regression analysis

D.   Market-basket analysis

85: In any numerical data set with a meaningful mean value, what is the minimum fraction of data that will fall within n standard deviations of the mean?

A.   1/n^2

B.   1/n

C.   1-1/n^2

D.   1/2n

86: Which of the following method can be used for modeling a categorical target variable?

A.   All of the Above

B.   Logistic Regression

C.   ARIMA

D.   Non-Linear Regression

E.   Regression

87: Which of the following is not a primary phase of a Hadoop Reducer?

A.   Sort

B.   Reduce

C.   Map

D.   Shuffle

88: Which of these is a possible architecture of a data mining system?

A.   No-coupling

B.   Magnetic coupling

C.   Transitive coupling

D.   Quickstart coupling

89: True or False? Artificial neural networks are linear predictive models.

A.   True

B.   False

90: The measured differences between a model and its predictions are known as:

A.   Noise

B.   Outliers

C.   Range

D.   Non-applicable data

91: Hash based technique, Transaction Reduction, Portioning, Sampling, and Dynamic Item Counting are all examples of what?

A.   Techniques to improve the efficiency of an Apriori algorithm

B.   Method to repeatedly scan the scan the database and check a large set of candidates by pattern matching.

C.   Methods of generating frequent item sets without candidate generation.

D.   Methods for finding a compact description for a subset of data.

92: Which of the following is part of a retail customer data mining strategy?

A.   customer testimonials

B.   holiday sale

C.   money-back guarantee

D.   loyalty cards

93: Which decision tree method performs multi-level splits when computing classification trees?

A.   ID3 (Iterative Dichotomiser 3)

B.   C4.5 algorithm

C.   CART (Classification and Regression Trees)

D.   CHAID (Chi Square Automatic Interaction Detection)

94: What is the advantage of the k-Medoids Clustering Algorithm over the k-Means Clustering (Lloyd's) Algorithm?

A.   uses iterative refinement

B.   more resistant to outliers

C.   all of the above

D.   represents clusters by center

95: The two major functions of BI servers are:

A.   Processing and management

B.   Source and results

C.   Management and delivery

D.   Application and delivery

96: Which of the following is not an appropriate tool for harvesting data from a website that accesses its database through Javascript/AJAX calls?

A.   All of the above are appropriate

B.   Selenium

C.   PhantomJS

D.   wget

97: A descriptive approach to exploring data that can help identify relationships among values in a database is:

A.   Predictive analysis

B.   Function activation

C.   Link analysis

D.   Clustering

98: How do you measure interestingness in association patterns?

A.   measure variance

B.   measure relevance

C.   meaure accuracy

D.   measure lift

99: Which of the following is not valid JSON?

A.   {"answer": "this one"}

B.   {"answer": ["this one"]}

C.   {["answer": "this one"]}

D.   All are valid

100: Where can a website operator generally find data on her customers' IP addresses?

A.   HTTP request headers

B.   cookies

C.   server logfiles

D.   all of the above