There are a lot of ways to gain massive speed improvements in Python. In this post I will list different resources to learn how to obtain these improvements (and to use Python for compute-intensive tasks). If you have encountered other interesting talks, modules, websites, tips, please feel free to inform me.

Python internals, general talks:

Profiling:

Numpy tips:

Alternative Python Implementations:

Another alternative is to use other Python implementations:

  • Cython: write C extensions for your Python code and add static typing
  • PyPy: just-in-time (JIT) compilation
  • Jython: dynamic compilation to Java bytecode that runs in JVM

Distributed computing:

GPU programming:

Scientific Python Distributions:

There are optimized Python distributions available, containing multiple useful scientific modules.

The Student Startup Trip was a very short introduction to the world of entrepreneurs in London. Due to the collision with ECIR2014, I was only able to attend the second half of the trip.

However, it was a very inspiring visit! The biggest lessons I learned where the big impact of working in a start-up and the always present ‘failure-is-learning’ mentality of the entrepreneurs. Furthermore, the diverse backgrounds of the entrepreneurs showed that their is no single correct path to start your own company. I’m looking forward to a similar trip next year!

The trip also translated into some articles in DataNews(Dutch):

ECIR2014 was hosted in Amsterdam and was the first conference I ever visited. It was a very enriching experience. Some interesting papers related to my research (with their abstracts) are listed below:

  • Boilerplate Detection and Recoding: Many information access applications have to tackle natural language texts that contain a large proportion of repeated and mostly invariable patterns – called boilerplates –, such as automatic templates, headers, signatures and table formats. These domain-specific standard formulations are usually much longer than traditional collocations or standard noun phrases and typically cover one or more sentences. Such motifs clearly have a non-compositional meaning and an ideal document representation should reflect this phenomenon. We propose here a method that detects automatically and in an un- supervised way such motifs; and enriches the document representation by including specific features for these motifs. We experimentally show that this document recoding strategy leads to improved classification on different collections.
  • GTE-Cluster: A Temporal Search Interface for Implicit Temporal Queries(demo):In this paper, we present GTE-Cluster an online temporal search interface which consistently allows searching for topics in a temporal perspective by clustering relevant temporal Web search results. GTE-Cluster is designed to improve user experience by augmenting document relevance with temporal relevance. The rationale is that offering the user a comprehensive temporal perspective of a topic is intuitively more informative than retrieving a result that only contains topical information. Our system does not pose any constraint in terms of language or domain, thus users can issue queries in any language ranging from business, cultural, political to musical perspective, to cite just a few. The ability to exploit this information in a temporal manner can be, from a user perspective, potentially useful for several tasks, including user query understanding or temporal clustering.
  • Metric Spaces for Temporal Information Retrieval: Documents and queries are rich in temporal features, both at the meta-level and at the content-level. We exploit this information to define temporal scope similarities between documents and queries in metric spaces. Our experiments show that the proposed metrics can be very effective for modeling the relevance for different search tasks, and provide insights into an inherent asymmetry in temporal query semantics. Moreover, we propose a simple ranking model that combines the temporal scope similarity with traditional keyword similarities. We experimentally show that it is not worse than traditional keyword-based rankings for non-temporal queries, and that it improves the overall effectiveness for time-based queries.

  • Time-Aware Focused Web Crawling:There is a plethora of information inside the Web. Even the top commercial search engines can not download and index all the available information. So, in the recent years, there are several research works on the design and implementation of focused topic crawlers and also on geographic scope crawlers. Despite other areas of information retrieval, research on Web crawling is not using the temporal information extracted from Web pages in the used crawling criteria. Therefore, our research challenge is the use of temporal data extracted from Web pages as the main crawling criteria to satisfy a given temporal focus. The importance of the time dimension is quite amplified when combined with topic or geography, but now we want to study it isolated. The used approach is based on temporal segmentation of Web pages text. It only follows links within segments tagged with dates in the scope of restriction. A precision around 75% was achieved in preliminary experimental results.

  • Automatically Retrieving Explanatory Analogies from Webpages:Explanatory analogies make learning complex concepts easier by elaborately mapping a target concept onto a more familiar source concept. Solu- tions exist for automatically retrieving shorter metaphors from natural language text, but not for explanatory analogies. In this paper, we propose an approach to find webpages containing explanatory analogies for a given target concept. For this, we propose the use of a ‘region of interest’ (ROI) based on the observation that linguistic markers and source concept often co-occur with various forms of the word ‘analogy’. We also suggest an approach to identify the source con- cept(s) contained in a retrieved analogy webpage. We demonstrate these ap- proaches on a dataset created using Google custom search to find candidate web pages that may contain analogies.

  • Temporal Expertise Profiling:We introduce the temporal expertise profiling task: identifying the skills and knowledge of an individual and tracking how they change over time. To be able to capture and distinguish meaningful changes, we propose the concept of a hierarchical expertise profile, where topical areas are organized in a taxonomy. Snapshots of hierarchical profiles are then taken at regular time intervals. Further, we develop methods for detecting and characterizing changes in a person’s profile, such as, switching the main field of research or narrowing/broadening the topics of research. Initial results demonstrate the potential of our approach.

  • Towards an Entity–Based Automatic Event Validation:Event Detection algorithms infer the occurrence of real–world events from natural language text and always require a ground truth for their validation. However, the lack of an annotated and comprehensive ground truth makes the evaluation onerous for humans, who have to manually search for events inside it. In this paper, we envision to automatize the evaluation process by defining the novel problem of Entity–based Automatic Event Validation. We propose a first approach which validates events by estimating the temporal relationships among their representative entities within documents in the Web. Our approach reached a Kappa Statistic of 0.68 when compared with the evaluation of real–world events done by humans. This and other preliminary results motivate further research effort on this novel problem.