Metaheuristic search techniques have been extensively used to automate the process of generating test cases and thus providing solutions for a more cost-effective testing process. This approach to test automation, often coined as “Search-based Software Testing” (SBST), has been used for a wide variety of test case generation purposes. Since SBST techniques are heuristic by nature, they must be empirically investigated in terms of how costly and effective they are at reaching their test objectives and whether they scale up to realistic development artifacts. However, approaches to empirically study SBST techniques have shown wide variation in the literature. This paper presents the results of a systematic, comprehensive review that aims at characterizing how empirical studies have been designed to investigate SBST cost-effectiveness and what empirical evidence is available in the literature regarding SBST cost-effectiveness and scalability. We also provide a framework that drives the data collection process of this systematic review and can be the starting point of guidelines on how SBST techniques can be empirically assessed. The intent is to aid future researchers doing empirical studies in SBST by providing an unbiased view of the body of empirical evidence and by guiding them in performing well designed empirical studies.
Voting is the process through which a democratic society determines its government. Therefore, voting systems are as important as other well-known critical systems, such as air traffic control systems or nuclear plant monitors. Unfortunately, voting systems have a history of failures that seems to indicate that their quality is not up to the task. Because of the alarming frequency and impact of these malfunctions, in recent years a number of vulnerability analysis exercises have been carried out against voting systems to determine if their confidentiality, integrity, and availability can be compromised. We have participated in two such large-scale projects, sponsored by the Secretaries of State of California and Ohio, in which the electronic voting machines used in those two states were tested. In our testing, we identified major flaws and implemented a number of attacks, which allowed us to take complete control of the examined voting systems. As a result of these evaluations, the Secretaries of State recommended changes to improve the security of the voting process. In this paper, we describe the methodology that we used in testing the two real-world electronic voting systems we evaluated, the findings of our analysis, our system-wide attacks, and the lessons we learned.
Program input syntactic structure is essential for a wide range of applications such as test case generation, software debugging and network security. However, such important information is often not available (e.g., most malware programs make use of secret protocols to communicate) or not directly usable by machines (e.g., many programs specify their inputs in plain text or other random formats). Furthermore, many programs claim they accept inputs with a published format, but their implementations actually support a subset or a variant. Based on the observations that input structure is manifested by the way input symbols are used during execution and most programs take input with top-down or bottom-up grammars, we devise two dynamic analyses, one for each grammar category. Our evaluation on a set of real-world programs shows that our technique is able to precisely reverse engineer input syntactic structure from execution. We apply our technique to hierarchical delta debugging (HDD) and network protocol reverse engineering. Our technique enables the complete automation of HDD, in which programmers were originally required to provide input grammars, and improves the runtime performance of HDD. Our client study on network protocol reverse engineering also shows that our technique supersedes existing techniques.
The potential of communication networks and middleware to enable the composition of services across organisational boundaries remains incompletely realised. In this paper we argue that this is in part due to outsourcing risks, and describe the possible contribution of Service-Level Agreements (SLAs) to mitigating these risks. For SLAs to be effective, it should be difficult to disregard their original provisions in the event of a dispute between the parties. Properties of understandability, precision and monitorability ensure that the original intent of an SLA can be recovered, and compared to trustworthy accounts of service behaviour to resolve disputes fairly and without ambiguity. We describe the design and evaluation of a domain-specific language for SLAs that tend to exhibit these properties, and discuss the impact of monitorability requirements on service provision practices.
IT system architectures and many other kinds of structured artifacts are often described by formal models or informal diagrams. In practice, there are often a number of versions of a model or diagram, such as a series of revisions, divergent variants, or multiple views of a system. Understanding how versions correspond or differ is crucial, and thus automated assistance for matching models and diagrams is essential. We have designed a framework for finding these correspondences automatically based on Bayesian methods. We represent models and diagrams as graphs whose nodes have attributes such as name, type, connections to other nodes, and containment relations, and we have developed probabilistic models for rating the quality of candidate correspondences based on various features of the nodes in the graphs. Given the probabilistic models, we can find high quality correspondences using search algorithms. Preliminary experiments focusing on architectural models suggest that the technique is promising.
One of the most important properties of a good software engineering process and of the design of the software it produces is robustness to changing requirements. Scenario-based analysis is a popular method for improving the flexibility of software architectures. This paper demonstrates a search-based technique for automating scenario-based analysis in the software architecture deployment view. Specifically, a novel parallel simulated annealing search algorithm is applied to the real-time task allocation problem to find baseline solutions which require a minimal number of changes in order to meet the requirements of potential upgrade scenarios. Another simulated annealing based search is used for finding a solution which is similar to an existing baseline when new requirements arise. Solutions generated using a variety of scenarios are judged by how well they respond to different system requirements changes. The evaluation is performed on a set of problems with a controlled set of different characteristics.
The Verifying Compiler (VC) project is a core component of the Dependable Systems Evolution Grand Challenge. The VC offers the promise of automatically proving that a program or component is correct, where correctness is defined by program assertions. While several VC prototypes exist, all adopt a semantics for assertions that is unsound. This paper presents a consolidation of VC requirements analysis activities that, in particular, brought us to ask targeted VC customers what kind of semantics they wanted. Taking into account both practitioners’ needs and current technological factors, we offer recovery of soundness through an adjusted definition of assertion validity that matches user expectations and can be implemented practically using current prover technology. For decades there have been debates concerning the most appropriate semantics for program assertions. Our contribution here is unique in that we have applied fundamental software engineering techniques by asking primary stakeholders what they want and based on this, proposed a means of efficiently realizing the semantics stakeholders want using standard tools and techniques. We describe how support for the new semantics has been added to ESC/Java2, one of the most fully developed VC prototypes. Case studies demonstrate the effectiveness of the new semantics at uncovering previously indiscernible specification errors.
In this paper, we explore the concept of code readability and investigate its relation to software quality. With data collected from 120 human annotators, we derive associations between a simple set of local code features and human notions of readability. Using those features, we construct an automated readability measure and show that it can be 80% effective, and better than a human on average, at predicting readability judgments. Furthermore, we show that this metric correlates strongly with three measures of software quality: code changes, automated defect reports, and defect log messages. We measure these correlations on over 2.2 million lines of code, as well as longitudinally, over many releases of select projects. Finally, we discuss the implications of this study on programming language design and engineering practice. For example, our data suggests that comments, in of themselves, are less important than simple blank lines to local judgments of readability.
Hard real-time systems always choose not to use garbage collection in order to avoid their unpredictable executions. Much effort has been expended trying to build predictable garbage collectors which can provide both temporal and spatial guarantees. Unfortunately, most existing work leads to systems that cannot easily achieve a balance between temporal and spatial performances. Moreover, the scheduling of garbage collectors has not been integrated into modern real-time scheduling frameworks, which makes the benefits provided by the advancement of scheduling techniques very difficult to obtain. This paper argues that the existing design criteria for real-time garbage collectors do not reflect the unique requirements of flexible hard real-time systems. As a part of our design criteria, a new performance indicator is proposed to describe the capability of a real-time garbage collector to achieve a better balance between temporal and spatial performances. A hybrid garbage collection algorithm is designed accordingly which also uses dual priority scheduling algorithm to reclaim spare capacity whilst guaranteeing deadlines. The algorithm has been implemented and evaluated in a real-time Java environment.
Astrophysicist Paul Davies discusses new approaches to finding intelligent life elsewhere in the universe. The Seti scientist’s new book is called Eerie Silence and is on a lecture tour of the UK.
You can hear an extended version of this interview in our latest Science Weekly Extra podcast.
Anthropologist Rick Potts is opening a new exhibition at the Smithsonian’s National Museum of Natural History in Washington DC. It’s called What does it mean to be human?
In the newsjam we discuss the new body set up to investigate an IPCC climate change report, sequencing the genomes of an entire family, and the new energy record about to be smashed at the LHC.
When it comes to theatre, sound is just as important as vision. It’s the subject of a lecture this week in London organised by the Wellcome Trust. Neuroscientist Prof Sophie Scott of University College London and theatre director Jonathan Holmes go on stage at London’s Bloomsbury Theatre to demonstrate. You’ll hear some drama from actor Seth Sinclair.
The Observer’s science editor Robin McKie and Guardian science correspondent Ian Sample join the pod.
Post your comments below.
Join our Facebook group.
Listen back through our archive.
Follow the podcast on our Science Weekly Twitter feed and receive updates on all breaking science news stories from Guardian Science.
Subscribe free via iTunes to ensure every episode gets delivered. (Here is the non-iTunes URL feed).
Това беше ипотпал новина за Интернет технологии "Science Weekly podcast: Paul Davies on new ways to find aliens; and the importance of sound in theatre". Четете само в Ипотпал News.
Britain must celebrate its scientists, because if the voters do, then so will the politicians
National Science and Engineering Week – running now with 2,000-plus exhibitions, lectures, open days and debates for an expected audience of 1.5 million – began as a whistle in the dark. Back in 1994, the science minister, William Waldegrave, secured a derisory £100,000 for the first one, and it seemed like a gimmick.
The charge of cynicism was unfair: Waldegrave was that rare thing, a minister with a prior and genuine interest in science. But the gesture came near the end of a long period of devastation of an intellectual tradition that had delivered Newton, Faraday, Darwin, James Clerk Maxwell, Rutherford and one of the unsung giants of the 20th century, Paul Dirac. In 15 years of Conservative government, ambitious projects had been abandoned, long-established research teams broken up, laboratories closed, universities starved and institutions privatised. The asset-stripping continued for another three years and, by 1997, British science had a stagnant and impoverished culture, creaking equipment and demoralised personnel.
Paradoxically, it also had a lively national festival of science, engineering and technology, and a separate, slightly later funfair in Edinburgh, both of which attracted crowds of buzzing schoolchildren and delighted adults. The science community took Waldegrave’s crust not as a sop but a challenge, and began to campaign for the re-election of reason and curiosity to the national debate. Thatcherite logic had argued that, if the economy really needed research, the market would provide it. No such thing happened. France, Germany, Japan and the US went on increasing investment in R&D while Britain became the place for merchant bankers and estate agents. But a freshly politicised community had by then understood that, in a democracy, science had to speak up, and so – at their successive jamborees – scientists did just that. They spelled out how information technology was forging a society in which knowledge was the real capital, and economic growth the interest that it accrued.
Here we go again. Last week the Royal Society reminded us that, while British science again faces cuts, France, Germany and the US are spending more than ever. Meanwhile, the inventor James Dyson urged the Tories not to cut the tax credits that support R&D. Peter Mandelson showed some sign of listening in an interview at the weekend, but that anyone should even need to make the argument shows how quickly forgotten have been the lessons of the past 30 years. Instead of paying university bosses the super-salaries we report on today, Britain must celebrate its scientists, because if the voters do, then so, eventually, will the politicians. We need our science festivals more than ever.
Search based optimization techniques have been applied to structural software test data generation since 1992, with a recent upsurge in interest and activity within this area. However, despite the large number of recent studies of applicability of different search based optimization approaches, there has been very little theoretical analysis of the types of testing problem for which these techniques are well-suited. There are also few empirical studies that present results for larger programs. This paper presents a theoretical exploration of the most widely studied approach, the global search technique embodied by Genetic Algorithms. It also presents results from a large empirical study that compare the behaviour of both global and local search based optimization on real world programs. The results of this study reveal that there exist cases of test data generation problems that suit each algorithm, thereby suggesting that a hybrid global-local search (a Memetic Algorithm) may be appropriate. The paper presents a Memetic Algorithm along with further empirical results studying its performance.
Describing and managing activities, resources and constraints of software development processes is a challenging goal for many organizations. A first generation of Software Process Modeling Languages (SPMLs) has appeared in the nineties but failed to gain broad industrial support. Recently however, a second generation of SPMLs appeared, leveraging the strong industrial interest for modeling languages such as the UML. In this article, we propose a comparison of these UML-based SPMLs. While not exhaustive, this comparison concentrates on SPMLs most representative of the various alternative approaches, ranging from UML-based framework specializations to full-blown executable meta-modeling approaches. To support the comparison of these various approaches, we propose a frame gathering a set of requirements for process modeling, such as semantic richness, modularity, executability, conformity to the UML standard, and formality. Beyond discussing the relative merits of these approaches, we also evaluate the overall suitability of these UML based SPMLs for software process modeling. Finally, we discuss the impact of these approaches on the current state of the practice, and conclude with lessons we have learned in doing this comparison.
This paper presents an innovative model of a program’s internal behavior over a set of test inputs, called the probabilistic program dependence graph (PPDG), that facilitates probabilistic analysis and reasoning about uncertain program behavior, particularly that associated with faults. The PPDG construction augments the structural dependences represented by a program dependence graph with estimates of statistical dependences between node states, which are computed from the test set. The PPDG is based on the established framework of probabilistic graphical models, which are used widely in a variety of applications. This paper presents algorithms for constructing PPDGs and applying them to fault diagnosis. This paper also presents preliminary evidence indicating that a PPDG-based fault localization technique compares favorably with existing techniques. The paper also presents evidence indicating that PPDGs can be useful for fault comprehension.
Recently, there has been a proliferation of service-based systems, i.e. software systems that are composed of autonomous services, but can also use software code. In order to support the development of these systems, it is necessary to have new methods, processes, and tools. In this paper we describe a UML-based framework to assist with the development of service-based systems. The framework adopts an iterative process in which software services that can provide functional and non-functional characteristics of a system being developed are discovered, and the identified services are used to re-formulate the design models of the system. The framework uses a query language to represent structural, behavioural, and quality characteristics of services to de identified, and a query processor to match the queries against service registries. The matching process is based on distance measurements between the queries and service specfications. A prototype tool has been implemented. The work has been evaluated in terms of recall, precision, and performance measurements.
Ordinal regression has wide applications in many domains where the human evaluation plays a major role. Most current ordinal regression methods are based on Support Vector Machines (SVM) and suffer from the problems of ignoring the global information of the data and high computational complexity. On the other hand, although Linear Discriminant Analysis (LDA) and its kernel version, Kernel Discriminant Analysis (KDA) takes consideration of the global information of the data as well as the distribution of the classes and its performance has been proved in classification, it fails to be used for solving ordinal regression problems because ordinal information of the data can not be unutilized. To solve this problem, in this paper, we propose a novel regression approach by extending the Kernel Discriminant Learning using a rank constraint. The proposed algorithm is very efficient since the computational complexity is linear to the data size. We demonstrate experimentally that the proposed method is capable to preserve the rank of data classes in a projected data space. In comparison to several ordinal regression methods, our method is more efficient and is competitive with them in accuracy.
Efficiently and effectively searching for similar videos is an important and non-trivial problem in content-based video search systems. In this paper, we propose a subspace symbolization approach, namely SUDS, for content-based search on very large video databases. The novelty of SUDS is that it explores the data distribution in subspaces to build a visual dictionary with which the video data are processed by deriving the string matching techniques with two-step data simplification. Specifically, we first propose an adaptive approach, called VLP, to divide the whole visual feature space into a series of subspaces of variable lengths, from which the dominant ones are selected. By clustering the video keyframes over each dominant subspace, a stable visual dictionary is built and a compact video representation model is eveloped by transforming each keyframe into a word that is a series of symbols in the dominant subspaces, and further each video into a sequence of words. Then, we present an innovative similarity measure called CVE, which adopts a complementary information compensation scheme based on the visual features and sequence ontext of videos. Finally, an efficient two-layered index strategy with a number of query optimizations is proposed to facilitate video search. The experimental results demonstrate the high effectiveness and efficiency of SUDS.
In this paper, we present a new type of spatial query called Nearest Surrounder (NS) query. An NS query searches the nearest polygon-shaped spatial objects (referred to as nearest surrounder (NS) objects) for consecutive ranges of angles around a specified query point. With additional angular information provided with NS objects, an NS query is more informative than many other spatial queries. We derive two NS query variants, namely, multi-tier NS (m-NS) query and angle-constrained NS (ANS) query. An m-NS query searches multiple layer of NS objects for the same range of angles from a query point. An ANS query searches NS objects within a specified range of angles. To evaluate NS queries and their variants, we explore anglebased and distance-based bound properties of polygons. Based on these properties, we devise two efficient algorithms, namely, Sweep and Ripple. They access objects in an order according to their orientations and distances to the query point, respectively, based on R-tree. They can also finish a search with at most one index lookup and progressively deliver a query result. Through empirical studies, we evaluate the proposed algorithms and report their performance for both synthetic and real object sets.
This paper presents a knowledge discovery framework for the construction of Community Web Directories, a concept that we introduced in our recent work, applying personalization to Web directories. In this context, the Web directory is viewed as a thematic hierarchy and personalization is realized by constructing user community models on the basis of usage data. In contrast to most of the work on Web usage mining, the usage data that are analyzed here correspond to user navigation throughout the Web, rather than a particular Web site, exhibiting as a result a high degree of thematic diversity. For modeling the user communities, we introduce a novel methodology that combines the users’ browsing behavior with thematic information from the Web directories. Following this methodology we enhance the clustering and probabilistic approaches presented in previous work and we also present a new algorithm that combines these two approaches. The resulting community models take the form of Community Web Directories. The proposed personalization methodology is evaluated both on a specialized artificial and a general-purpose Web directory, indicating its potential value to the Web user. The experiments also assess the effectiveness of the different machine learning techniques on the task.
Most of the common techniques in text mining are based on the statistical analysis of a term either word or phrase. Statistical analysis of a term frequency captures the importance of the term within a document only. However, two terms can have the same frequency in their documents, but one term contributes more to the meaning of its sentences than the other term. Thus, the underlying text mining model should indicate terms that capture the semantics of text. In this case, the mining model can capture terms that present the concepts of the sentence, which leads to discover the topic of the document. A new concept-based mining model that analyzes terms on the sentence, document, and corpus levels is introduced. The concept-based mining model can effectively discriminate between non-important terms with respect to sentence semantics and terms which hold the concepts that represent the sentence meaning. The proposed mining model consists of sentence-based concept analysis, document-based concept analysis, corpus-based concept-analysis, and concept-based similarity measure. The term which contributes to the sentence semantics is analyzed on the sentence, document, and corpus levels rather than the traditional analysis of the document only. The proposed model can efficiently find significant matching concepts between documents according to the semantics of their sentences. The similarity between documents is calculated based on a new concept-based similarity measure. The proposed similarity measure takes full advantage of using the concept analysis measures on the sentence, document, and corpus levels in calculating the similarity between documents. Large sets of experiments using the proposed concept-based mining model on different datasets in text clustering are conducted. The experiments demonstrate extensive comparison between the concept-based analysis and the traditional analysis. Experimental results demonstrate the substantial enhancement of the clustering quality using the sentence-based, document-based, corpus-based and combined approach concept analysis.
Traditional decision tree classifiers work with data whose values are known and precise. We extend such classifiers to handle data with uncertain information. Value uncertainty arises in many applications during the data collection process. Example sources of uncertainty include measurement/quantisation errors, data staleness, and multiple repeated measurements. With uncertainty, the value of a data item is often represented not by one single value, but by multiple values forming a probability distribution. Rather than abstracting uncertain data by statistical derivatives (such as mean and median), we discover that the accuracy of a decision tree classifier can be much improved if the “complete information” of a data item that takes into account the probability density function (pdf) of that item’s value is utilised. We extend classical decision tree building algorithms to handle data tuples with uncertain values. Extensive experiments have been conducted that show that the resulting classifiers are more accurate than those using value averages. Since processing pdf’s is computationally more costly than processing single values (e.g., averages), decision tree construction on uncertain data is more CPU demanding than that for certain data. To tackle this problem, we propose a series of pruning techniques that can greatly improve construction efficiency.
Researchers have rigorously studied the resampling, algorithms, and feature selection approaches to the class imbalance problem. No systematic studies have been conducted to understand how well these methods combat the class imbalance problem and which of these methods best manage the different challenges posed by imbalanced data sets. In particular, feature selection has rarely been studied outside of text classification problems. Additionally, no studies have looked at the additional problem of learning from small samples. This paper presents a first systematic comparison of the three types of methods and of seven feature selection metrics evaluated on small sample data sets from different applications. We evaluated the performance of these metrics using area under the receiver operating characteristic and area under the precision-recall curve. We compared each metric on the average performance across all problems and on the likelihood of a metric yielding the best performance on a specific problem. We examined the performance of these metrics inside each problem domain. Finally, we evaluated the efficacy of these metrics to see which perform best across algorithms. Our results showed that signal-to-noise ratio and Feature Assessment by Sliding Thresholds are great candidates for feature selection in most applications, especially when selecting very small numbers of features.
Given a set of objects and their pairwise distances, we wish to determine a visual representation of the data. We use the quartet paradigm to compute a hierarchy of clusters of the objects. The method is based on an NP-hard graph optimization problem called the Minimum Quartet Tree Cost problem. This paper presents and compares several heuristic approaches to approximate the optimal hierarchy. The performance of the algorithms is tested through extensive computational experiments and it is shown that the Reduced Variable Neighbourhood Search heuristic is the most effective approach to the problem, obtaining high quality solutions in short computational running times.
Taxonomies, representing hierarchical data, are a key knowledge source in multiple disciplines. Information processing across taxonomies is not possible unless they are appropriately merged for commonalities and differences. For taxonomy merging the first task is to identify common concepts between the taxonomies. Then these common concepts along with their associated concepts in the two taxonomies need to be integrated. Doing this in a conflict-free manner is a challenging task and generally requires human intervention. In this paper we explore the possibility of asymmetrically merging one taxonomy into another, automatically. Given one or more source taxonomies and a destination taxonomy, modeled as directed acyclic graphs, we present intuitive algorithms that merge relevant portions of the source taxonomies into the destination taxonomy. We prove that our algorithms are conflict-free, information-lossless and scalable. We also define precision and recall measures for evaluating enriched taxonomies, such as TA, the result of merging two taxonomies, with TI, the ideal merger. Our experiments indicate the effectiveness of our approach.
t-Closeness is a privacy model recently defined for data anonymization. A data set is said to satisfy t-closeness if, for each group of records sharing a combination of key attributes, the distance between the distribution of a confidential attribute in the group and the distribution of the attribute in the entire data set is no more than a threshold t. Here, we define a privacy measure in terms of information theory, similar to t-closeness. Then, we use the tools of that theory to show that our privacy measure can be achieved by the postrandomization method (PRAM) for masking in the discrete case, and by a form of noise addition in the general case.
A major assumption in many machine learning and data mining systems is that the data must be from the same feature representations and that the data distributions in the training and test data are the same. However, in many real-world applications, this assumption does not hold. For example, we sometimes have a classification task in one task domain, but we only have sufficient training data in another task domain where the data may be in a different feature space or follow a different distribution. In these cases, knowledge transfer, if done successfully, would greatly benefit learning in our interested domain by avoiding expensive data labeling tasks. In recent years, \emph{transfer learning} has emerged as a new technique to address this problem. This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression and clustering problems. We discuss the relationship between transfer learning and other related research areas, such as domain adaptation, multi-task learning and sample selection bias as well as co-variate shift, and explore some potential future problems in knowledge transfer research.
Visual methods have been widely studied and used in data cluster analysis, \textit{e.g.}, the VAT algorithm for visual analysis of cluster tendency. Given a pairwise dissimilarity matrix $\bm{D}$ of a set of $n$ objects, methods such as VAT generally represent $\bm{D}$ as an $n\times n$ image $\mathrm{I}(\tilde{\bm{D}})$ where the objects are reordered to highlight cluster structure as dark blocks along the diagonal of the image. A major limitation of such visual methods is their inability to highlight cluster structure in $\mathrm{I}(\tilde{\bm{D}})$ when $\bm{D}$ contains clusters with highly complex structure. In this paper, we address this limitation by proposing a Spectral VAT algorithm, where $\bm{D}$ is mapped to $\bm{D’}$ in an embedding space by spectral decomposition of the Laplacian matrix, and then reordered to $\bm{\tilde{D’}}$ using the VAT algorithm. We propose a strategy to automatically determine the number of clusters in $\mathrm{I}(\bm{\tilde{D’}})$, as well as a visual method for cluster formation from $\mathrm{I}(\bm{\tilde{D’}})$ based on the difference between diagonal blocks and off-diagonal blocks. In addition, we propose a sampling-based extended scheme to enable visual cluster tendency assessment and data partitioning for large data sets. Extensive experimental results on several synthetic and real-world data sets demonstrate the effectiveness of our algorithms.
In this work, web-based metrics that compute the semantic similarity between words or terms are presented and compared with the state-of-the-art. Starting from the fundamental assumption that similarity of context implies similarity of meaning, relevant web documents are downloaded via a web search engine and the contextual information of words of interest is compared (context-based similarity metrics). The proposed algorithms work automatically, do not require any human annotated knowledge resources, e.g., ontologies, and can be generalized and applied to different languages. Context-based metrics are evaluated both on the Charles-Miller dataset and on a medical term dataset. It is shown that context-based similarity metrics significantly outperform co-occurrence based metrics, in terms of correlation with human judgment, for both tasks. In addition, the proposed unsupervised context-based similarity computation algorithms are shown to be competitive with state-of- the-art supervised semantic similarity algorithms that employ language-specific knowledge resources. Specifically, context-based metrics achieve correlation scores of up to 0.88 and 0.74 for the Charles-Miller and medical datasets, respectively. The effect of stop-word filtering is also investigated for word and term similarity computation. Finally, the performance of context-based term similarity metrics is evaluated as a function of the number of web documents used and for various feature weighting schemes.
Long time-series datasets are common in many domains, especially scientific domains. Applications in these fields often require comparing trajectories using similarity measures. Existing methods perform well for short time-series but their evaluation cost degrades rapidly for longer time-series. In this work, we develop a new time-series similarity measure called the Dictionary Compression Score (DCS) for determining time-series similarity. We also show that this method allows us to accurately and quickly calculate similarity for both short and long time-series. We use the well known Kolmogorov Complexity in information theory and the Lempel-Ziv compression framework as a basis to calculate similarity scores. We show that off-the-shelf compressors do not fair well for computing time-series similarity. To address this problem, we developed a novel dictionary-based compression technique to compute time-series similarity. We also develop heuristics to automatically identify suitable parameters for our method, thus removing the task of parameter tuning found in other existing methods. We have extensively compared DCS with existing similarity methods for classification. Our experimental evaluation shows that for long time-series datasets, DCS is accurate, and it is also significantly faster than existing methods.