How to find duplicate words in Java String? [Solution]

Problem :  Write a Java program to print the duplicate words from a given statement e.g. if given String is "Java and JavaScript are totally different, JavaScript follows Java" then your program should print "Java" and "JavaScript" because those two are 2 duplicate words from given String. You need to consider all cases e.g. given String can be null, empty, may or may not contain any duplicate words, but for simplicity, you can assume that sentence will always in English and only use ASCII characters, alphabets, and numerals, no special character.  It's better to get the requirement right of the problem in the beginning even if the interviewer doesn't tell you everything. Directly jumping into solution without asking a couple of questions may not go well with many interviewers who looks for detail oriented candidates.

If you are practicing these coding problems for an interview, I also suggest you take a look at Cracking the Coding Interview book. It contains 150 Programming Questions and their Solutions, which is good enough to clear most of the beginner and intermediate programming job interviews.

How to find duplicate word in String


Solution : In order to find duplicate words, we first need to divide the sentence into words. For that, you can split the String on space using a greedy regular expression, so that it can handle multiple white spaces between words. You can use the split() method of java.lang.String class to do that, this method returns an array of words.

Once we list of words, we can insert them into HashSet. Since HashSet doesn't allow duplicate and its add() method return false if an object already exists in HashSet, we can find all duplicate words. Just loop over array, insert them into HashSet using add() method, check output of add() method. If add() returns false then it's a duplicate, print that word to the console.

This is also one of the top 20 String based problems from interviews. You can see that article to more coding problems based upon String.

One of the follow-up questions of this is how do you find a number of times each duplicate word has appeared in a sentence? For example, in our coding problem, your solution should also print count of both Java and JavaScript e.g. Java : 2 and JavaScript : 2 because they have appeared twice in a sentence.


You can solve this problem by choosing another hash-based data structure like a hash table, which maintains key value pair. Java provides several implementation of hash table data structure e.g. HashMap, Hashtable, and ConcurrentHashMap, but for general purpose, HashMap is good enough.

In short, just use HashMap instead of HashSet to keep count of duplicate words in the sentence. This is also similar to the problem of finding duplicate characters in String. Instead of character, you need to find duplicate words, as shown here.

Another follow-up question related to this problem is how do you remove duplicate words from String in Java? Which is actually the same problem of removing duplicate elements from an array? If you know how to solve that, you can easily solve this one as well. If you face any problem,  see this solution.

How to find duplicate words in Java String


Java Program to find duplicate words in String

Here is our solution to the problem of finding duplicate words in a sentence in Java. I have used HashSet to find duplicates. The time complexity of this solution is O(n) because we need to iterate over all element in the array. You also need a buffer of the same size as original array, hence, the space complexity is also O(n), so it may not be suitable for a really long String. You need more memory to find even a single duplicate word if your String is huge.

import java.util.Collections;
import java.util.HashSet;
import java.util.Set;

/**
 * Java Program to demonstrate how to find duplicate words in String.
 */
public class DuplicateWordsInString{

    public static void main(String[] args) {
        String test = "This sentence contains two words, one and two";
        Set<String> duplicates = duplicateWords(test);
        System.out.println("input : " + test);
        System.out.println("output : " + duplicates);
    }


    /**
     * Method to find duplicate words in a Sentence or String
     * @param input String 
     * @return set of duplicate words
     */
    public static Set<String> duplicateWords(String input){
        
        if(input == null || input.isEmpty()){
            return Collections.emptySet();
        }
        Set<String> duplicates = new HashSet<>();
        
        String[] words = input.split("\\s+");
        Set<String> set = new HashSet<>();
        
        for(String word : words){
            if(!set.add(word)){
                duplicates.add(word);
            }
        }
        return duplicates;
    }
    
    
}

Output :
input : This sentence contains two words, one and two
output : [two]

From the output it's clear that our program is working as expected, It right prints that "two" is the only duplicate word in given String. Nonetheless, we are going to write some unit test to further test our solution for different input values.


JUnit tests

Here is my list of JUnit test class for our solution. We are going to test our solution for empty String, null String, String with only duplicates, String without any duplicates and String which contains multiple spaces between words.  Each JUnit tests one input. If your input set is large then you can also consider using parameterized JUnit test.

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertTrue;

import java.util.Collections;
import java.util.Set;

import org.junit.Test;

public class DuplicateWordsInStringTest {    
   
    @Test
    public void testWithEmptyString(){        
        Set<String> output = DuplicateWordsInString.duplicateWords("");
        assertEquals(Collections.emptySet(), output);
    }
    
    @Test
    public void testWithNullString(){
        Set<String> output = DuplicateWordsInString.duplicateWords(null);
        assertEquals(Collections.emptySet(), output);
    }
    
    @Test
    public void testWithDuplicateString(){
        Set<String> output = DuplicateWordsInString.duplicateWords("one one one two two");
        assertTrue(output.contains("one"));
        assertTrue(output.contains("two"));
        assertTrue(output.size() == 2);
    }
    
    @Test
    public void testWithOutDuplicates(){
        Set<String> output = DuplicateWordsInString.duplicateWords("one two three");
        assertEquals(Collections.emptySet(), output);
    }
    
    @Test
    public void testWithMultipleSpaceBetweenWord(){
        Set<String> output = DuplicateWordsInString.duplicateWords(" one   two    three ");
        assertEquals(Collections.emptySet(), output);
    }
    
    
}


That's all about how to find duplicate words in a given String in Java. We have used HashSet data structure to solve this problem and our solution has time and space complexity of O(n). For a curious developer, can you come up with a solution with better time and space complexity? How about a solution with time complexity in order of O(k) where k is duplicate words? or O(logN)?

Recommended books for Coding Interviews
  • Coding Puzzles: Thinking in code (see here)
  • Algorithms For Interviews By Adnan Aziz and Amit Prakash (see here)
  • Cracking the Coding Interview: 150 Programming Questions and Solutions 

No comments :

Post a Comment