CISC 7700X - Introduction to Data Science

CISC 7700X HW# 1 (due by 2nd class;): Using the Iris dataset, build a kNN model to identify the species of a flower given sepal_length, sepal_width, petal_length,petal_width. Feel free to use whatever language/tool you are comfortable with. I encourage you to write C/C++/Java/C#/SQL/Python code. You may also use Excel, or Weka or Colab or whatever other library/tool you find. Submit (via email), the model code.

CISC 7700X HW# 2: Continuing with the Iris dataset, plot the histograms for each of the attributes: sepal_length, sepal_width, petal_length, petal_width. Find the average and standard deveation for sepal_length, sepal_width, petal_length, petal_width for each label. Find the median and IQR for sepal_length, sepal_width, petal_length, petal_width for each label. Use bootstrap method to find error bounds on all of the above.

CISC 7700X HW# 3: We have a labeled training data set: hw3.data1.csv.gz.

Thinking of a linear model, we come up with:

y = 24*column1 + -15*column2 + -38*column3 + -7*column4 + -41*column5 + 35*column6 + 0*column7 + -2*column8 + 19*column9 + 33*column10 + -3*column11 + 7*column12 + 3*column13 + -47*column14 + 26*column15 + 10*column16 + 40*column17 + -1*column18 + 3*column19 + 0*column20 + -6

if y is > 0 then 1 othewise -1.

What is the accuracy? Calculate the confusion matrix for this model. If cost of a false negative is $1000, and cost of a false positive is $100, (and $0 for an accurate answer), what is the expected economic gain?

How can we tweak the model to increase economic gain? Come up with a model that maximizes economic gain (approximations are OK; try guestimating a few possibilities in a spreadsheet, etc.).

Email the numbers and the steps you used to calculate things (you can do most of this homework in a spreadsheet [Excel?], but I highly encourage you to write code---learn Python if not sure where to start).