Dynamic Enterprise Skills Taxonomy Generation
Author(s): Rakesh Pimplikar, Anuradha Bhamidipaty, Joydeep Mondal, Ankita Gupta, Ruchi Mahindru, Gen Lauer, Radha Ratnaparkhi
Full text: submitted version
Abstract: We present a novel system to generate an enterprise skills taxonomy for an organization. The taxonomy dynamically adapts to the changes in the organizational environment. Changes could be due to acquisition of new skills by the employees, hiring of new employees, attrition of employees, or changes in the focus of the organization’s strategy. Our approach involves crawling millions of public (technical) documents authored by the employees, and combining them with the organization’s internal data (like employee HR data). In the process, skills are extracted and assigned to employees. Our approach doesn’t have the problems of manually curated taxonomies which are difficult to generate and to keep up to date. We also propose an approach to customize the skills taxonomy for the specific enterprise at hand by utilizing a domain specific seed taxonomy. Our system has a lot of potential to impact various workforce related applications such as analyzing expertise gaps and employee development.
Keywords: Named-Entity Recognition; Asymmetric Similarity; Enterprise Skills Taxonomy
Review 1 (by Sabrina Kirrane)
This paper proposes an approach that can be used to automatically generate skill taxonomies for enterprises, based on the extraction of information from a combination of internal and external documents. Although the work is interesting and the evaluation over real word data looks promising, the paper focuses primarily on the technical details without discussing the pros and cons of the proposed approach. It would also be good to know more about the challenges encountered, especially with respect to the authoritativeness of the data sources and the suitability of the approach in terms of different domains. For example, how representative are the data sources in terms of the self professed skills of the employees. Also, which data sources are considered good sources for automatic skill extraction and which are not. In addition, given that domain knowledge (i.e. seed taxonomies, domain word lists, part-of-speech tags, character n-grams, and capitalization patterns) plays a major role in the automatic extraction of topics, it would be useful to provide some background details on how these were constructed. The paper concludes by stating that the proposed approach was applied in a variety of enterprises, however no information is provided as to the degree of reuse across domains or if any domain specific challenges were encountered. Finally, it is worth noting that as far as I can tell the paper the does not employ semantic technologies in the traditional sense (i.e. RDF, ontologies etc..), but rather focuses on NLP and machine learning. Other comments: -The paper contains numerous grammatical errors, for example “Outline of the paper” -> The outline of the paper “To enable users have a common understanding” -> To enable users to have a common understanding ****** Many thanks for the clarifications provided in the rebuttal. If the paper is accepted it would be beneficial from a readers perspective to include this additional information.
Review 2 (by anonymous reviewer)
Still I think that not mentioning nor referring to semantic web technologies is a major drawback of this paper. The authors do not refer to this critique in their rebuttal. Additionally it is always very difficult to estimate the quality of the work if the underlying dataset and the associated evauluation lack transparency due to confidentiality. Given these drawbacks I stay with my score. --- The paper describes and and evaluates a dynamic enterprise skills taxonomy generator whose purpose it is to dynamically identify existing and evolving skills in large organisations. The system uses internal and external documents from which skills and skills sets are being calculated which again are used as tags to classify employees according to their expertise. The system applies co-occurence probabilities to find directional relationships between skill pairs based on a seed taxonomy that captures the specific context in which the system is being deployed. The system is complemented by a revision mechanism to dynamically capture occurring changes in skill sets and data sources. As discussed in section 2 of the paper the idea to utilize semantic techniques like NER, taxonomies and corresponding similarity measures for skills management and HR purposes is not really new. The authors' claim for innovativeness stems from the argument that their system significantly reduces the time and effort needed to build, curate and customize skills taxonomies manually. Additionally they argue that their system is well suited for purposes like identifying expertise gaps and employee development. Bearing that in mind the paper describes the applied methodolgies in a cosnistent and understandable manner but the application focus gets blurred throughout the paper. None of the two mentioned application areas ever appear throuhout the paper again. The evaluation seems to generate promissing results but has to be considered very superficial and incomplete. What I am missing is a validity check done by HR experts actually working with the taxonomy. So far the evaulation compares various scoring techniques which is interesting but does not really support the paper's claims. Additionally it should be mentioned that the authors do not refer to any sort of Semantic Web technologies, despite the fact that several techniques described in the paper (i.e. data ingestion, taxonomy management, terminology management) could be perfectly supported for the purposes described. While parts of the paper are well written other parts - esoeciall section 4.3, 4.4, 4.6 and 5 contain serveral typos, grammaticla errors and unclear formulations. In case of acceptance the paper should definitelly be proof-read by a native speaker, especially with respect to the correct use of articles.
Review 3 (by anonymous reviewer)
The paper presents a system to dynamically generate enterprise skills taxonomies. It is motivated by a use case of managing employee skills in enterprise environments, with a large number of employees and dynamically evolving skills. The use case is clearly relevant and in scope for the ESWC In Use track. However, in terms of the contribution the focus is much more on the algorithms for taxonomy generation rather than the original use case of assigning skills to employees. This is problematic on a number of levels. The contribution and presentation of the algorithms for the taxonomy generation process is in principle more suitable for the research track. However, the degree of innovation is rather limited from a research perspective. The techniques for taxonomy generation are rather standard in the field of information extraction (named entity recognition, taxonomic relation extraction etc.) There is not much that is specific for extracting skills (as opposed to other kinds of concepts). With regards to the evaluation, there is no comparison against alternative, state-of-the-art techniques. Further, results are difficult to reproduce as information about the evaluation data set is not disclosed. The evaluation is also not done against the originally motivated use case, i.e. for assigning skills to employees - only the tasks of extracting skills (without any relationship to employees) is considered. Overall, it seems very questionable whether this approach works for this case at all What kind of documents would be required such that one can usefully detect skill of the author? Clearly, topics can be extracted, but I would claim that only resumes are representative documents for skills. If I mention the term “project manager” in a document this does not imply that I have skills in project management, perhaps I am complaining about my project manager, etc. These aspects are not all covered in the evaluation - overall, for an in use paper there is lack of discussion of how the results are really useful and applicable/applied in practice.
Review 4 (by Rinke Hoekstra)
I thank the authors for their responses to my review, and I have carefully considered them. Unfortunately, I don't think the responses address all of my concerns, and in some cases even the contrary: related work section is still rather weak, and the responses to my remarks on the "proper tree" vs "forest" and incremental updates do not really convince me. I have updated my evaluation accordingly. === This paper discusses and evaluates a system for semi-automatic construction and augmentation of a skills taxonomy in an enterprise setting. The purpose of the taxonomy is to have a more fine-grained view of employee-expertise within a very large company. As the authors argue, constructing taxonomies is a very labour intensive task that often makes that the up-font costs of taxonomy construction are considered to be too high, undermining of the benefits of having a good taxonomy in place. I agree that this is a serious problem, and that semi-automatic taxonomy construction could greatly reduce that cost. The described system extracts skill concepts from a variety of sources, using a co-occurrence measure to identify potential relations between them and uses Wikipedia to determine the directionality of these relations. The evaluation shows that the system developed by the authors performs adequately for their purposes, but there is no comparison with other existing systems (e.g. from the cited literature), and it is unclear from the paper whether the work has been used in production within IBM. There are several parts of the text where the quality of English could benefit from the eye of a native speaker. **Related Work** The authors argue that other approaches (Zhao et al.) are much more dependent on Wikipedia (i.e. they only use skills described Wikipedia), but they do not really discuss this difference in detail. It seems that determining directionality is essential for positioning concepts in the hierarchy: what happens if the identified/extracted skill concepts are not found in Wikipedia? How are they positioned? The authors do describe their approach later on in the paper, but the related works section is not specific enough in making explicit the distinctions with prior work. Merely listing a number of papers [3, 4, 7, 10, 14, 15] and  and  is not good enough; it just tells the reader that you read the literature, but not how your work differs from it, or why they fall short in addressing the issues. **System Description** A bit about what's going on here is not clear to me, and is also related to the remark on Zhao et al's work. The authors use CRF to do named entity recognition for skills in external documents (I assume to also learn them), but later the skills seem to be restricted to those already present in the HR records, and then the directionality is determined by co-occurrence in Wikipedia articles. If skills are detected by the NER engine that do not occur in Wikipedia articles, then they are discarded from the system. From this I am led to conclude that the system *is* restricted to only Wikipedia articles, and then only those that describe skills that were already present in the HR articles. This approach only works under the assumption that the skill-coverage of Wikipedia is "complete" for the use case at hand. That is: the need for the corpus of internal/external documents is only needed to limit the taxonomy generation to the skills found in those, and to link the skills to the authors. The asymmetric matrix is only computed based on wikipedia articles. But... the authors discarded the work of Zhao because of its dependence on Wikipedia. Even though they claim Wikipedia could be replaced with any other KB... doesn't the same hold for Zhao's work, in principle? The resulting restriction is the same: skills that are not mentioned in Wikipedia do not make it into your taxonomy. Also, the relations between terms/concepts is entirely dependent on Wikipedia. It looks like you only need the corpus to link between persons and skills, but not to learn skills. **Incremental Updates** Incremental updates seem to augment the asymmetric similarity matrix, but nothing will be removed from the matrix? How does one update the matrix given new information? How are the scores computed? **Taxonomy Generation** The authors restrict the allowed taxonomy to a proper tree. Why? There may be situations where you don't have enough information to generate a satisfactory proper tree. A forest would be just as good? This is essentially what you're doing by allowing for a dummy category under which the algorithm places all newly generated categories that cannot be positioned under any other category.
Review 5 (by Anna Tordai)
This is a metareview for the paper that summarizes the opinions of the individual reviewers. The paper describes a system for automatic taxonomy generation to generate an enterprise skills taxonomy which is an interesting domain for this track. Reviewers note that the HR use case is not referred to in the main parts of the paper, in particular with respect to the evaluation. In the rebuttal the authors clarify that information about the application of the taxonomy by HR cannot be shared due to confidentiality. On the one hand we do not want to prevent papers from being submitted if they aren’t allowed to share parts of the information. On the other hand, lack of transparency does lower the value of such work for the community. Reviewers describe the evaluation as shallow and incomplete, in particular as the system does not appear to be particularly tailored to this specific domain and could be compared to state of the art. Reviewers also note that there is no explicit mention of semantic technology. Laura Hollink & Anna Tordai