Nautical Crime Investigation Services - Problem 2
Overview
NCIS is developing software to identify, track, and build profiles for vessels involved in incidents at sea. This software intends to use a large language model to collect and analyze publicly available web data. With a predefined list of incident categories, the algorithm will gather relevant information about vessels and their reported incidents. It will filter this data to focus solely on incident reports, extracting key details such as vessel name, flag state, and company for any vessel involved in the report. This information will be used to create comprehensive profiles of the reported vessels.
Proposed Problem
This project aims to develop a strategy for determining performance indicators to compare different search engines for yielding high-quality results from web scraping.
Note that this problem is independent of the other problem statements submitted by NCIS.
Context for quality indicators
For each assigned topic, NCIS will have 100+ search engine prompts to yield relevant results on incident reports, which will be web-scraped. For the scope of this problem, assume the top 20 results returned will be scraped, with no predetermined sites to be excluded. The challenge is determining which search engine yields the best results.
To define the quality of search engine results, several factors must be considered. Some suggestions include topical bias and clustering, the novelty of the returned sources, relevance to the assigned topic, and search engine volatility.
Assigned topic: underreporting/misreporting of catch
Baseline Problem Statement
Develop a strategy for evaluating different search engines on the quality of the yielded results.
Extended Problem Statement (time permitting)
Develop a method to manage search engine volatility and evaluate the performance of search engine results for a given prompt over time, ensuring consistent and high-quality data collection.
Skills
Required
- NLP techniques and tools for text analysis
- Proficiency in web scraping techniques and tools
Preferrable
- Familiarity with search engine optimization (SEO) principles