May 11, 2016
Data Mining | Machine Leanring | Course Project

1 Project Guidelines
15 April 2016
The formative assessment for AAML project is a written analysis of approximately 3,000 words, due at May 31, 2016, which forms 60% of the course grade.
You are required to explore and analyse a set of texts, which are Hillary Clinton’s 7,945 email released in 20051.
These original emails (released as PDF documents) are available on the US State Departments website. You can find the released documents cleaned and normalized hosted on Kaggle[1].
Apply learning/mining techiques to extract any interesting, surprising, or significant insight from this dataset. What question you investigate, and how you analyse the texts are your choices, but you must justify the choice.
To give you some examples, you might investigate the sentiment expressed about di↵erent countries [3], or analyse her views on foreign policy [4]. You can also link these texts with some external data sources, e.g. linking them with the news achieve / public tweets and learning how Hillary’s sentiment changed due to the public opinion.
2 Data Description

April 22, 2016
Data Mining |Machine Learning | Final Term Project: Option 2 Unsupervised Data Mining (Clustering)

Part 1
Generate a set S of 500 points (vectors) in 3-dimensional Euclidean space. Use the Euclidean distance to measure the distance between any two points. Write a program to find all the outliers in your set S and print out these outliers. If there is no outlier, your program should indicate so. Use any programming language of your choice (specify the programming language you use in the project).
Next, remove the outliers from S, and call the resulting set S’.

Part 2
(1) Write a program that implements the hierarchical agglomerative clustering algorithm taught in the class to cluster the points in S’ into k clusters where k is a user-specified parameter value.
(2) Repeat part 1 and (1) above on two additional different datasets.
Notes on the hierarchical agglomerative clustering algorithm
In determining the distance of two clusters, you should consider the following definitions respectively: 
the distance between the nearest two points in the two clusters, 
the distance between the farthest two points in the two clusters, 
the average distance between points in the two clusters, 
the distance between the centers of the two clusters.
Use the definition that yields the best performance where the performance is measured by the Silhouette coefficient.