Quantcast
Channel: Active questions tagged selenium - Stack Overflow
Viewing all articles
Browse latest Browse all 97771

Create a code to assign key words to a company/startup in Python

$
0
0

I have created a dataframe with a list of company and I have assigned to each company several keywords (from now on: tags). The dataframe looks like the following:

https://res.cloudinary.com/eutop/image/upload/v1570743879/samples/Picture1_u76vn4.png

In the first column, "Company ID" there is the company ID: a unique identifier for each company. While on column "Tags" there are the tags associated with each company. For instance, the company with ID 1 has the following tags: opf, oxidation, resource efficiency, textiles.

As of today, the dataframe counts 1.000 companies and 15.000 tags.

I have populated this dataframe manually. Basically, for each company, I have browsed its website and looked for keywords to use as tags and added them to my dataframe. Despite this approach is pretty accurate, it is super time-consuming. I need to repeat this procedure to add new companies to my dataframe. I want to automatize it.

My idea is the following:

  1. Open the website of a new company - not within the 1.000 in my original dataframe - with selenium webdriver (through Python);

  2. Take all the available text on the home page and the other website pages and concatenate all the gathered text in a string (from now on we will call this string: string_).

  3. Then I will loop each of the 15.000 tags already in my dataframe in string_. If the tag is in there I will save and keep it else, I will discard it.

This approach has a major drawback: How can I train and teach to my code to find new keywords not already among the existing 15.000 tags in my dataframe?

I will lose all the keywords on the website that are not in my current dataframe with 15.000 tags. A minor solution for this is to include all the synonyms of my tags (something like this: How to get synonyms from nltk WordNet Python). For me, this is not enough.

Plus, I know that there are some Python libraries that extract keywords from a text, but string_ is not a structured text, but just some chunk of text concatenated together.

Can you please advise which approach would you adopt in order to solve this problem?

At the end of the day, I want to write a code which able to extract tags from the 15.000 keywords I have already in my dataframe and new ones that are not included.


Viewing all articles
Browse latest Browse all 97771

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>