Video-based datasets for Continuous Sign Language are scarce due to the challenging task of recording videos from native signers or deaf people; and the reduced number of people who know a specific country’s sign language to annotate those videos. On the other hand, the COVID-19 pandemic has evidenced the importance of sign language translation for communicating with signers. For instance, sign language interpreters have played a key role in delivering nationwide health messages in sign languages. In this paper, we present a framework for creating a multi-modal sign language interpretation dataset based on videos. We use this framework to create the first dataset for Peruvian Sign Language interpretation, which was annotated by hearing volunteers with intermediate knowledge of Peruvian Sign Language guided by the video audio. Technology developed for the deaf community should involve signers in its design and construction. For that reason, we rely on hearing people to produce a first version of the annotations, which should be reviewed by native signers in the future. The following are more specific contributions of our work: i) we design PeruSIL, a framework for the creation of a Sign Language interpretation dataset consisting of an annotation convention based on the glossing system and a pipeline to combine text and keypoint landmark annotations; ii) we publicly release the first Peruvian Sign Language interpretation multi-modal dataset (AEC) annotated using our framework; iii) we evaluate the annotation done by hearing people by training a sign language recognition model and testing it in the same dataset and in a second dataset where the subjects and annotator are deaf. Our model reaches up to 80.3% of accuracy among a minimum of five classes (signs) in the same dataset, and up to 51.8% in the second dataset. Nevertheless, when analyzing accuracy by subject in the second dataset, the subjects that represent the 63% of the instances reach individual accuracy of more than 50%.