In this paper, we address the temporal moment localization issue, namely, localizing a video moment described by a natural language query in an untrimmed video. This is a general yet challenging vision-language task since it requires not only the localization of moments, but also the multimodal comprehension of textual-temporal information (e.g., “first” and “leaving”) that helps to distinguish the desired moment from the others, especially those with the similar visual content. While existing studies treat a given language query as a single unit, we propose to decompose it into wo components: the relevant cue related to the desired moment localization and the irrelevant one meaningless to the localization. This allows us to flexibly adapt to arbitrary queries in an end-to-end framework. In our proposed model, a language-temporal attention network is utilized to learn the word attention based on the temporal context information in the video. Therefore, our model can automatically select “what words to listen to” for l ocalizing the desired moment. We evaluate the proposed model on two public benchmark datasets: DiDeMo and Charades-STA. The experimental esults verify its superiority over several state-of-the-art methods.