In order to construct and study large social networks from communication data, one must infer unobserved ties (e.g. i is connected to j ) from observed communication events (e.g. i emails j). Often overlooked, however, is the impact this tie definition has on the corresponding network, and in turn the relevance of the inferred network to the research question of interest. We studied the problem of network inference and relevance for two email data sets of different size and origin. In each case, we generated a family of networks parameterized by a threshold condition on the frequency of emails exchanged between pairs of individuals. After demonstrating that different choices of the threshold correspond to dramatically different network structures, we then defined the relevance of these networks with respect to a series of prediction tasks that depend on various network features. In general, we find: a) that prediction accuracy is maximized over a non-trivial range of thresholds; b) that for any prediction task, choosing the optimal value of the threshold yields a sizable (~ 30%) boost in accuracy over native choices; and c) that the optimal threshold value appears to be (somewhat surprisingly) consistent across data sets and prediction tasks.
Tuesday, January 26, 2010
Free and open to the public