The Making of Vocaloid
Thousands of people inside a convention center just outside Tokyo watch as a woman turns a handle on an oversized music box. The crowd – along with thousands more streaming the show online – have spent all night at the third annual Nico Nico Choparty waiting to hear a specific computerized voice. After a few more rotations, the music box’s toy jingle is replaced by an electronic voice stuttering out words over cymbal crashes. The audience screams, and start thrusting glow sticks into the air, as the holographic image of Hatsune Miku appears onstage.
The turquoise-haired Miku has become the face of Vocaloid, a singing-synthesizer program allows users to generate vocals through their computer. Outside of Japan, Miku has become bigger than the program she represents. Western media outlets aired news reports about her earliest concerts in Japan, where flabbergasted anchors tried to describe a show starring a hologram. (The 2012 Tupac Coachella show had yet to happen.) Others have embraced her: Miku was invited to open for Lady Gaga on several of her recent North American tour dates, she got remixed by Pharrell, and Hatsune Miku Expos were held in Los Angeles and New York in October. She even performed on the Late Show With David Letterman.
Yet focusing on the cartoon avatar doesn’t do full justice to the impact the program has had in Japan. Despite slow sales at first, Vocaloid became a phenomenon in 2007. Coupled with the growth of online video-sharing sites, musicians and producers developed a new genre that soon boomed beyond the Internet: Artists who once hawked CD-Rs at comics conventions became Japanese chart crashers, traditional pop stars tried to sound like robots, music retailers opened new sections devoted to Vocaloid music, and karaoke chains uploaded hundreds of Vocaloid songs into their libraries. In short, the Vocaloid technology has carved out a massive place in Japanese pop culture, all through computer-generated singing.
Man has long been interested in inanimate objects that can speak. Ancient Roman poet Virgil, Roger Bacon, and Pope Sylvester II all claimed to own brazen heads – brass devices shaped like human craniums that could purportedly answer questions. The first attempts to replicate the human voice came in 1779, when Russian professor Christian Kratzenstein developed a machine capable of generating the five long vowel sounds (a, e, i, o, u). The next century saw more scientists create their own speaking machines, and in the early 20th century electrical synthesizers improved the quality of generated speech even further.
It wasn’t until 1961, though, that a machine would sing. Using vocoder technology they had developed themselves years earlier, New Jersey-based Bell Laboratories scientists had an IBM 704 computer “perform” the song “Daisy Bell.” Following Bell Labs’ lead, plenty have continued to explore how computers can sing better. “By the end of last century, the most successful and credible synthesis was the aria of the ‘Queen of the Night’ from Mozart's opera ‘The Magic Flute’ made in 1984 by Yves Potard and Xavier Rodet using the CHANT synthesizer,” explains Jordi Bonada, a senior researcher at the Music Technology Group at Pompeu Fabra University in Barcelona.
We realized that it might be a better idea to record not just a song from a particular singer, but a set of vocal exercises with a great phonetics range, and build a model capable of singing any song.
Bonada would know. He’s spent a great deal of his career devoted to singing-synthesis programs. Around the time that Bonada joined Pompeu Fabra in 1997, “Yamaha got in contact with us with some ideas about an interesting research project related to voice transformation, [and] that became the seed of something much bigger.” The goal was to make bad singers sound better in the karaoke booth. “The project codename was Elvis and it lasted for two years,” Bonada says. “It never became a product. One reason was that the system was based on spectral morphing techniques, and required a recorded performance by a professional singer for each song.” It was too big an undertaking given the usual thickness of a Japanese karaoke book.
“Nevertheless, after Elvis, we realized that it might be a better idea to record not just a song from a particular singer, but a set of vocal exercises with a great phonetics range, and build a model capable of singing any song,” Bonada says. “And with that in mind, we agreed with Yamaha to start a new research project aimed at creating a singing synthesizer. That’s also when I met Hideki Kenmochi for the first time.”
The Father of Vocaloid
Hideki Kenmochi loves music. Growing up in Shizuoka, he enjoyed the organ while in kindergarten and his mother signed him up for neighborhood piano lessons. But when he turned 10, he stopped. “I didn’t like it anymore,” he laughs. At 16 he took up the violin, a hobby he still pursues. What he devoted himself to on Saturdays during his adolescence, however, is what would eventually earn him the nickname “father of Vocaloid.”
“I used to listen to short-wave radio growing up,” he says, pulling up some photos of magazines devoted to radios he used to read. It served as a gateway to computers for Kenmochi. “I went to a small computer exhibition with a friend, and we tried to make some basic programs but failed. The person next to us, though, taught us how to do it. I couldn’t buy a computer, so I started going to a computer store on Sundays and on holidays and just spent the day trying to make software. I would bring a lunch box.”
Kenmochi joined Yamaha in 1993, working on active noise control projects (for example, noise-cancelling headphones). In March 2000, he found himself part of the joint venture between Yamaha and Pompeu Fabra focused on singing-synthesizer technology. “Basically, most of the research was done by our side in Barcelona, including the development of the core signal processing libraries in C++. The product design and development was done by Yamaha,” says Bonada, explaining the workflow of the project.
In Barcelona, the Pompeu Fabra team had a few starting points to go from, most notably the Elvis project. “One challenge was how to process and transform singer recordings so that it would result in a performance of a given song sounding as natural as possible and providing the feeling of a continuous flow,” Bonada says. “The second challenge was how to process and transform the singer recordings so that it would result in a performance of a given song sounding as natural as possible and providing the feeling of a continuous flow. With that purpose we devised a novel voice model (EpR ) which allowed us to transform vocal timbres in a natural manner while preserving subtle details.”
One style we can’t really do with Vocaloid now is very rough singing. The program assumes you can detect pitch, it’s the basic frequency. But in rough voices, you sometimes can’t. We want to improve that.
“We talked and talked about what the singing syntheis should be,” Kenmochi says. “We at Yamaha developed the basic framework for the system. The joint venture resulted in a prototype for Vocaloid in March 2002. At the time, it was codenamed Daisy.”
The interface would eventually become easier to use, but the general premise of the software remains the same today as it did during its first phase. Users write lyrics, and then can adjust various aspects of the computer-generated voice afterward, such as pitch or how long specific syllables are delivered. Today, users can also select from various ways the singing is delivered. That said, Kenmochi admits that “one style we can’t really do with Vocaloid now is very rough singing. The program assumes you can detect pitch, it’s the basic frequency. But in rough voices, you sometimes can’t. We want to improve that.”
The next step was figuring out how to sell it. “One of the original ideas was for Yamaha itself to sell the software,” says Kenmochi. “But Vocaloid is a singing synthesizer, and what’s important is the singing voice. We could have made our own voice library, but the variety would have been very limited. So we decided to license the technology to third-party companies.”
As everything started clicking into place, the Vocaloid prototype was introduced to the world for the first time in 2003 at the German music trade show Musikmesse. “Originally we wanted to call it Daisy, but we gave up on that one pretty fast,” Kenmochi says with a chuckle. “We had to register the name as a trademark, and Daisy wasn’t happening. Vocaloid wasn’t actually our first candidate after Daisy, either. The first name we wanted to use was...I can’t actually disclose the name...but we searched for it and we were 95 percent [sure] we could use it, but then we searched for the name in Belgium...and it turns out there was a software with a very similar name to our candidate. So we had to scrap that.”
Thankfully, their third choice – Vocaloid – was alright everywhere, including Belgium. Version 1 of Vocaloid became available to the public on March 3, 2004, when British company Zero-G released Leon and Lola, a male and female voice respectively. It would take a little time, though, before the software became huge.
The Voice of Vocaloid
Two women sit in the recording booth waiting for the final stretch of work to start. Voice actor Yu Asakawa and a Crypton Future Media Global Marketing Manager playfully start singing “Happy Birthday” in the style of Marilyn Monroe to John F. Kennedy. In the adjacent studio, five men move around a mixing board. When they stop laughing, Wataru Sasaki says it’s time for the final reading for the English version of Vocaloid.
“Nerd,” Asakawa says, extending her arms as she lets the word roll off her tongue. “Neeeeerd,” she says with a different delivery after receiving further instructions. She runs through several different readings of the word, until Crypton’s director gives her the OK to move onto the next. After a handful, the day’s work is done.
“When I do other recording jobs, like for anime or video games, you are actually acting. I sometimes have to be passionate or sad, or I have to yell,” Asakawa says afterwards. “With Vocaloid, we always have to do the same tone. It’s difficult. I can’t go out drinking the day before recording for Vocaloid, because my throat will get bad.”
Yamaha, in conjunction with Pompeu Fabra, might have developed Vocaloid, but it was a Sapporo-based company that made it a phenomenon. Crypton Future Media created the character Hatsune Miku. And it was Wataru Sasaki that came up with most of the details that turned her into the perfect avatar for the software.
Crypton created a character, but their smartest move was leaving Hatsune Miku a blank canvas.
Vocaloid 1 was not a sales force when it hit stores in 2004. Kenmochi admits the product’s presentation was one reason that Vocaloid’s sales were initially sluggish. He pulls up photos of Zero-G’s original packaging for the Leon and Lola products. They feature nothing but a close-up photo of lips and some text. Of all the Vocaloid 1 releases, Crypton experienced the most success. The company designed a character, Meiko, and stuck her on the front of the box.
When the second version of Vocaloid was developed – featuring smoother vocals and an easier to use interface – Crypton designed a new character for the next installment. But that wasn’t the only change. “Vocaloid 1’s vocals were based off analytics of the human voice. For Vocaloid 2, we wanted to sample actual human voices,” Sasaki says. That was something that interested Sasaki personally. He grew up loving music that made heavy use of samples, citing DJ Shadow directly while large Stones Throw and Software stickers on his laptop say it indirectly. Sasaki spent his teen years making his own sample-heavy music, and originally landed a job at Crypton working primarily on sample CDs aimed at underground musicians. He ended up playing a large role in shaping Crypton’s Vocaloid 2 releases.
“I designed Hatsune Miku’s voice. I wanted to make it simple... Just a really clear voice,” Sasaki says. He says he had specific voice actors in mind from the earliest stages, and ended up recruiting one of them – Saki Fujita – who had done anime voice work. “The very first recording session we had with Saki went really well – voice recording sometimes takes more than four hours, but she had great concentration.”
One element that has changed since the original Hatsune Miku recording sessions has been how the sounds that end up in the character’s vocal bank are recited by the voice actor. Sasaki says they originally would give them a script with nonsense phrases. As time went on, though, they restructured and tuned the voice to give it more stretch (or strength).
After creating the vocal bank and getting designs from illustrator Kei Garo, Hatsune Miku (whose name roughly means “first sound future”) was ready to go on sale. Customers were immediately drawn to the character – stores sold out of the Miku software, and Crypton could barely keep up at first. “I was in Antwerp, for a conference at the time of its release, presenting about Vocaloid,” Kenmochi says. “I got a call from Sasaki, and he told me ‘Hatsune Miku is selling very well! More than I expected!’ He kept calling me while I was there. He couldn’t believe it.”
Crypton created a character, but their smartest move was leaving Hatsune Miku a blank canvas. Upon her release, some specific information about her was released – her age (16), her height (5.18 feet), and her weight (92.5 lbs). But that was pretty much it. Crypton allowed users to give Miku whatever personality they wanted. It played into an existing (and extremely popular) scene known as the doujin community. The term refers to works of art – historically, comics – that use pre-existing characters to create what amounts to fan fiction. Vocaloid tapped into this market, and extended beyond just music. Visual artists and amateur music video makers were drawn to Hatsune Miku, too. Crypton actively encouraged this character appropriation, creating the “Piapro Character License” allowing users to take Miku’s image and do what they want with it – as long as it isn’t for commercial gain.
The other reason Miku went from branding tool to a household name in Japan was simply down to good timing. Vocaloid 2 was released around the same time the Japanese video sharing site Nico Nico Douga was gaining traction. Imagine YouTube where the comment section is non-toxic but inescapable. (They literally scroll over the video as it plays, creating a sense of connection.) Musicians started uploading their original works to the site, and soon a Vocaloid community emerged. This initial scene included outfits such as supercell and Livetune, who would go on to experience mainstream success in the coming years (the latter, anchored by the producer kz, soundtracked a Miku-centric Vocaloid ad for Google Chrome and featured in that Pharrell remix video).
“I found out about Hatsune Miku from Nico Nico,” Vocaloid producer Hachioji-P says. Today, he’s one of the most well-known artists using Vocaloid in Japan, creating electro pop songs built around Miku’s digital delivery. He plays clubs and Nico Nico-sponsored events. “I had been making my own music for a while, club music. But it had all been instrumental, because I didn’t know anyone I could ask to sing over my music.”
Hachioji-P started connecting with other creators online, and soon he was meeting them at real world events like The Voc@loid M@ster, a Vocaloid-only gathering where independent creators can sell their music or art to other fans. At first, he was going to various Vocaloid club events that sprung up as the burgeoning genre grew in popularity. Then he started playing them. “We realized this was taking off when we got to play in big clubs located in Shibuya and Roppongi.”
It kept growing. Music retailers such as Tower Records Japan opened up sections devoted to Vocaloid music in their stores. Convenience store chain Family Mart has run multiple campaigns focused on Hatsune Miku, complete with silly ads. Karaoke system DAM stockpile hundreds of Vocaloid songs within their nationwide library, many of which are among the most popular choices in Japan. People are so amped to create their own Vocaloid characters and voice libraries that an entire sub-genre has emerged called utau, referring to a freeware version of the software featuring home-made vocal banks.
Vocaloid has even gone high art in recent years. Isao Tomita – one of the first Japanese musicians to acquire a synthesizer in the nation – made a symphony starring Hatsune Miku in late 2012, while an opera called “The End” starring her emerged shortly after. And, of course, live shows featuring holographic performers have become huge.
Quite simply, Vocaloid has become big business.
“At first, it was more like a playground,” Hachioji-P says. “Everybody just did whatever they wanted to do. But now, artists are more aware of what will happen if they get enough video views on Nico Nico. They can get famous, like us. It’s more commercialized today, they think about how the audience will react. Before, we just did whatever we wanted to do.”
I had been making my own music for a while, club music. But it had all been instrumental, because I didn’t know anyone I could ask to sing over my music.
Even Japanese pop stars have tried to latch themselves on to the Vocaloid bandwagon. Singer Mayu Watanabe is a member of Japan’s best-selling group AKB48, and technically the most popular member based on an election held in June. In 2012, she released an electro pop single called “Hikaru Monotachi.” Hachioji-P was recruited to produce it, and for that song’s video Watanabe looked very similar to a Vocaloid. That was on purpose. “When I made that, I was told ‘make her sound like a Vocaloid.’ I tried, but it was never totally like that... there was always some emotional nuance there. But I thought that was good for a human... a Vocaloid doesn’t have emotion, but that’s what makes that software stand out from an actual human.”
Hachioji-P hints at one of the weird conflicts of the technology. “My original goal was to try to get a voice as perfect as I could,” Dr. Serra, one of the members of the group at Pompeu Fabra, explains. “The reality is very different than that. People are not interested in getting the voice any better than it already is. That’s my feeling. For me, it was very surprising that people like Hatsune Miku because the quality is so different from a regular voice. For us, that’s because it’s not good enough. But they like that sort of robotic type of quality.”
Hachioji-P agrees. “The reason we were attracted to the mechanical voice was because Vocaloid 2 was really limited in terms of what we could do. In our circle, we played around with it as much as we could...and eventually, our tastes leaned more towards the more mechanical style.” Many listeners in Japan agree. Hachioji-P released his second full-length album Twinkle World this past August, and has played many large live events, including Japan’s most popular summer music festival, the Rock In Japan Fest.
For me, it was very surprising that people like Hatsune Miku because the quality is so different from a regular voice. For us, that’s because it’s not good enough. But they like that sort of robotic type of quality.
So what’s the future of Vocaloid? Yamaha thinks it’s expanding who can use it. The company released Vocaloid 3 near the end of 2011, but the more intriguing development was the Vocaloid programs aimed at non-Japanese users that trickled out afterwards. The very first Vocaloid software, by Zero-G, was in English. But until August 2013, Hatsune Miku was only available in Japanese. Now one can acquire an English version of Miku. And a Chinese, Korean, and Spanish Vocaloid as well.
“We really love Vocaloid and its community, and we wanted to create the first Spanish Vocaloid voices. Unfortunately, Vocaloid is not as popular in Spain or Europe as it is in Japan,” Bonada says. “So far we have created three Spanish voices named Maika, Bruno, and Clara, and even made a live concert where people could move them in real-time with a Kinect based system.”
“Ideally, we need to develop as many languages as possible for Vocaloid. People want to make songs with Vocaloid in their mother tongue... but it’s not possible right now! So we have to start developing them,” says Kenmochi. “Vocaloid is a never-ending story!”
Special thanks to Sena Fujisawa for the translation help. Top image copyright Crypton Future Media INC., www.piapro.net