字符串的最短唯一前缀

输入第一行为一个整数,表示字符串的行数,接下来的 N 行每行为一个字符串。

输入:

1
2
3
4
5
6
5
bytedance
toutiaohao
toutiaoapp
iesaweme
iestiktok

输出:

1
2
3
4
5
b
toutiaoh
toutiaoa
iesa
iest

题目描述:找到输入中每个字符串的最短唯一前缀。

数据范围:N ≤ 1000,总字符数 < 1000000。

输入数据确保没有一个字符串是另一个字符串的前缀。

思路是每个字符串和其他的所有字符串做比较,记录两个字符串间相同的位数,取其中的最大值作为字符串的唯一标识。

例如:

1
2
3
4
5
6
7
8
9
10
11
12
13
word = 'bytedance'

other_words = [
'toutiaohao',
'toutiaoapp',
'iesaweme',
'iestiktok',
]

# b != t
# b != i
# loc = 0
print('bytedance'[:loc+1])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
word = 'toutiaohao'

other_words = [
'bytedance',
'toutiaoapp',
'iesaweme',
'iestiktok',
]

# for 'toutiaoapp':
# t == t
# o == o
# ...
# a == a
# o == o
# h != a
# loc = 7
print('toutiaohao'[:loc+1])

按照这种思路写出代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def process(str_list):
for word in str_list:
# one char must be unique
if len(word) == 1:
print(word)
continue

# save the largest i for where the unique prefix ends
saved_i = None
for other_word in str_list:
# not do self compare
if word == other_word:
continue
i = 0
# find the first different char's location between word and other_word
while i < len(other_word):
if word[i] == other_word[i]:
i += 1
else:
# find the first different char
if not saved_i:
# for the first inner loop
saved_i = i
else:
# update saved_i if current i is larger
saved_i = max(i, saved_i)
break
# if other_word is a sub-string of word (this will not happed assured by the problem)
# i will be len(other_word)
# and the else part will never be reached
# which means saved_i still is None
# so check this case
if i == len(other_word):
saved_i = i

print(word[:saved_i+1])
continue


if __name__ == "__main__":
# get toutal input count
N = int(input())

if N <= 0:
exit()

str_list = []
for i in range(N):
str_list.append(input().strip()) # read a word

if N == 1:
print(str_list[0][0])
else:
process(str_list)

这个解法的复杂度非常高,每个字符串都会和其他所有的字符串比较一次,总共需要比较 $N\timesN-1$ 次,每次比较都需要顺序遍历字符串,因此复杂度为 $O(N^2\timesM)$,M 为字符串的平均长度。

后来问了一个同学,她的思路是按照列来处理,使用一个 index 来记录从 0 到当前列的字符串是否重复出现,例如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
str_list = ['abc', 'calls', 'called']

# 初始化 index
index = [-1] * len(str_list) # index -> [-1, -1, -1]

# 第一列
col = {} # prefix: line_num
col_num = 0
row_num = 0
'a'
col = {'a': 0}
index[row_num] = col_num # index -> [0, -1, -1]
row_num = 1
'c'
col = {'a': 0, 'c': 1}
index[row_num] = col_num # index -> [0, 0, -1]
row_num = 2
'c'
col = {'a': 0, 'c': -1} # 前缀 'c' 重复出现了
index[row_num] = col_num # index -> [0, -1, -1]

# 第二列
col = {}
col_num = 1
row_num = 0
index[row_num] = 0 != -1 # 已经找到最短前缀
row_num = 1
'ca'
col = {'ca': 1}
# index -> [0, 1, -1]
row_num = 2
'ca'
col = {'ca': -1} # 前缀 'ca' 重复出现
# index -> [0, -1, -1]

# ...
# 第五列
col_num = 4
row_num = 0
index[row_num] = 0 != -1 # 已经找到最短前缀
row_num = 1
'calls'
col = {'calls': 1}
# index -> [0, 4, -1]
row_nums = 2
'calle'
col = {'calls': 1, 'calle': 2}
# index -> [0, 4, 4]

# index 均不为 -1,找到所有唯一最短前缀

按照这种思路写出代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def process1(self, str_list):
index = [-1 for i in range(len(str_list))] # 每行的前缀index,初始设为-1,若找到前缀则更新为对应值
i = 0
while True:
col = {} # 记录从 0 到 i(当前列数)的前缀
for line_num, word in enumerate(str_list):
if i == len(word) - 1:
index[line_num] = i
if i <= len(word) - 1 and index[line_num] == -1: # 如果当前行没到结尾,并且还未找到前缀
if word[:i+1] in col: # 前缀 word[:i+1] 已经出现过
index[col[word[:i+1]]] = -1
else: #如果没有重复字符,则找到前缀
col[word[:i+1]] = line_num
index[line_num] = i
if -1 not in index:
break
i += 1
for i, key in enumerate(index):
print(str_list[i][:key+1])

这种方法的时间复杂度是 $O(MN)$,其中 M 是最长的唯一前缀长度,N 是字符串数量。空间复杂度是 $O(N)$,使用了一个数组 index 来记录每一行最短前缀的位置。

更新一下字典树(前缀树)的解法:

使用 path 记录经过节点的字母个数,当节点的 path 的值第一次为 1 时经过路径上的字符构成的字符串即为唯一最短前缀。例如以如下输入构建字典树:

1
2
3
bytedance
toutiaohao
toutiaoapp

从图中可以看出,对于 bytedance 来说,根节点后的第一个节点的 path 值就是 1,所以它的最短唯一前缀就是 b。而后两个字符串的前几个字符相同,所以这几个节点记录的路径值为 2。当第一次出现路径值为 1 的节点时,到达其最短唯一前缀。

实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
class TrieNode:
def __init__(self):
self.path = 0
self.end = 0
self.maps = [None] * 26


class Trie:
def __init__(self):
self.root = TrieNode()

def insert(self, word):
"""Insert a word into trie tree.

word: str, a word as string to inserted
"""
if not word:
return

node = self.root
node.path += 1 # this is actually the total number of words
for c in word:
# get the index of current char
index = ord(c) - ord('a')
# add a new trie node if current char is first shown
if not node.maps[index]:
node.maps[index] = TrieNode()
# else step into this char's node and update the path count
node = node.maps[index]
node.path += 1

# hit the end of the word, mark it as end
node.end += 1

def search(self, word):
"""Check whether a word is in trie tree.

word: str, a word as string to be searched
"""
if not word:
return

node = self.root
for c in word:
index = ord(c) - ord('a')
if not node.maps[index]:
return False
node = node.maps[index]
return node.end > 0

def get_shortest_unique_prefix(self, word):
"""Get the shortest unique prefix of a given word.

word: str, a word as string to be searched
"""
# this check should be done,
# but i omit it here for convenient,
# okkk... finally i did this.
assert self.search(word) == True, "word is not in trie tree"
# or should i insert it directly ???

node = self.root
shortest_unique_prefix = []
for c in word:
index = ord(c) - ord('a')
shortest_unique_prefix.append(c)
# note that the path of the first node
# which actually is the root node's path
# should not be checked and what we find
# is the node whose path first equals one
node = node.maps[index]
if node.path == 1:
return ''.join(shortest_unique_prefix)
return word


class Solution:
def process_via_trie(self, str_list):
trie = Trie()
for word in str_list:
trie.insert(word)

for word in str_list:
shortest_unique_prefix = trie.get_shortest_unique_prefix(word)
print(shortest_unique_prefix)


if __name__ == "__main__":
# get toutal input count
N = int(input())
solution = Solution()
if not N:
exit()

str_list = []
for i in range(N):
str_list.append(input().strip()) # read a word
if N == 1:
print(str_list[0][0])
else:
solution.process_via_trie(str_list)

然后测试的时候内存超过限制了…使用长度为 26 的列表会造成大量的空间浪费,尝试改成字典或者其他的数据结构应该可以优化空间的消耗。

更新把列表换成字典的解法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class TrieNode:
def __init__(self):
self.path = 0
self.end = 0
self.maps = {}


class Trie:
def __init__(self):
self.root = TrieNode()

def insert(self, word):
if not word:
return

node = self.root
node.path += 1
for c in word:
if c not in node.maps:
node.maps[c] = TrieNode()
node = node.maps[c]
node.path += 1

node.end += 1

def search(self, word):
if not word:
return

node = self.root
for c in word:
if c not in node.maps:
return False
node = node.maps[c]
return node.end > 0

def get_shortest_unique_prefix(self, word):
assert self.search(word) == True, "word is not in trie tree"
node = self.root
shortest_unique_prefix = []
for c in word:
shortest_unique_prefix.append(c)
node = node.maps[c]
if node.path == 1:
return ''.join(shortest_unique_prefix)
return word

相比于列表,字典只会在需要的时候创建空间,因此可以节省一些空间,但是结果依然超过了最大可用内存。